Learning Plan: AI/LLM Practical Applications for Solution Architecture

Goal: Develop detailed conceptual understanding of AI solution architectures to evaluate technical and economic viability, maintain technical credibility as a sanity-check resource, and position strategically in an AI-transformed landscape.

Target Depth: "AI-literate decision-maker and architect" - sufficient understanding to evaluate whether proposed solutions make sense before they're built, identify architectural limitations vs. implementation problems, and translate between technical possibilities and business requirements.

Time Commitment: 1 hour/day, sustained learning
Background: 15 years in education data/tech consulting, familiar with Karpathy's LLM content, regular Claude/ChatGPT user, data engineering background

Note on Structure: Phase 1 is designed to be completable in ~15 days before your strategy meeting. It front-loads actionable architectural knowledge. Phases 2-5 build deeper foundations and expand into specialized topics.

Phase 1: Fast Track to AI Architecture Credibility (Weeks 1-2, ~15 days)

Purpose: Equip you to evaluate RAG proposals, understand token economics, and assess tool usage architectures before your meeting. You'll learn just enough about embeddings and LLM mechanics to ask intelligent questions and spot architectural red flags.

Learning Plan: AI/LLM Practical Applications for Solution Architecture

Days 1-3: Embeddings & Vector Search Fundamentals

Primary Resource:

MIT Lecture: "Embeddings, RAG, and Vector Databases" (Mike Cafarella, March 2024)
PDF: https://dsg.csail.mit.edu/6.S079/lectures/lec10-s24.pdf
This is a ~50-slide academic lecture deck
Read through systematically, ~20 minutes per session
Focus on slides covering: what embeddings are, how similarity search works, trade-offs in embedding model selection

Supplementary Video:

"What are Embeddings" by StatQuest (search YouTube, ~15 min)
Watch at 1.25x speed
Provides intuitive visual explanation of how words become vectors

Why this matters: When someone proposes "embedding all your documents in a vector database," you need to understand: (1) why vectors enable semantic search vs. keyword search, (2) what "distance" means and why it matters, (3) embedding model implications (size, cost, vocabulary), and (4) retrieval precision limitations. This is the foundation for evaluating RAG proposals.

Key concepts to master:

Embeddings as semantic representations (king - man + woman ≈ queen)
Vector similarity metrics (cosine similarity, euclidean distance)
Embedding model trade-offs (performance vs. cost vs. dimensionality)
Why embeddings can't be reversed back to original text
Vocabulary limitations and out-of-vocabulary handling

Hands-on exercise (1-2 hours):

Option A: Use OpenAI's Embeddings API playground

Create embeddings for 5-10 related phrases
Visualize similarity scores
Change one word and observe embedding changes

Option B: Follow Microsoft's "Generate Embeddings" tutorial

https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-generate-embeddings
Read through code examples without necessarily running them
Focus on understanding the pipeline: text → chunks → embeddings → storage

Daily breakdown:

Day 1: MIT lecture slides 1-25 (embeddings fundamentals), StatQuest video
Day 2: MIT lecture slides 26-50 (vector databases, similarity search)
Day 3: Hands-on exercise + review notes

Success metric: Can you explain to a colleague why "search by meaning" works differently than "search by keywords" and what the cost/performance implications are?

Days 4-7: RAG Architecture End-to-End

Primary Resource:

Microsoft Generative AI for Beginners: RAG & Vector Databases
https://github.com/microsoft/generative-ai-for-beginners/blob/main/15-rag-and-vector-databases/README.md
Free, comprehensive, includes code examples
Read thoroughly, ~45 minutes per session
Skip the detailed implementation code, focus on architecture diagrams and decision points

Supplementary:

"Beyond Vector Databases: RAG Without Embeddings" (DigitalOcean, Aug 2025)

https://www.digitalocean.com/community/tutorials/beyond-vector-databases-rag-without-embeddings
Read sections on BM25, GraphRAG, Prompt-RAG
~30 minutes
Helps you understand when RAG without embeddings makes sense

Why this matters: RAG is the most common proposal you'll evaluate. You need to understand: (1) the full pipeline (ingestion → chunking → embedding → retrieval → generation), (2) where costs accumulate, (3) common failure modes (retrieval precision, context window limits), (4) when RAG is appropriate vs. fine-tuning or long-context models, and (5) architectural alternatives.

Key concepts:

RAG pipeline stages: prepare → embed → store → retrieve → augment → generate
Chunking strategies (fixed size vs. semantic, overlap considerations)
Vector database options (Pinecone, Chroma, Weaviate, Qdrant, pgvector) and selection criteria
Retrieval strategies: similarity search, hybrid search (keyword + semantic), reranking
Cost structure: embedding API calls, vector storage, retrieval latency, LLM inference
Failure modes: poor chunking loses context, retrieval misses relevant docs, hallucination despite grounding
When to use: dynamic knowledge bases, need for citations, cost-prohibitive to fine-tune
When NOT to use: stable/small knowledge base, need deterministic responses, ultra-low latency requirements

Analytical exercise (2-3 hours total across days):

Create a cost/architecture analysis document for a hypothetical RAG system:

Scenario: Education company wants to build "AI tutor" that answers questions using 10,000 textbook PDFs

Your analysis should include:

Chunking approach recommendation and rationale
Embedding model selection (OpenAI vs. open-source, dimension count)
Vector database choice and why
Estimated costs: embedding generation (one-time), storage (ongoing), retrieval+generation (per query)
Identified failure modes specific to this use case
When this should be RAG vs. fine-tuning vs. long-context Claude

This exercise forces you to make real architectural decisions with trade-off justifications.

Daily breakdown:

Day 4: Microsoft tutorial sections 1-3 (RAG overview, architecture, ingestion)
Day 5: Microsoft tutorial sections 4-5 (retrieval, generation), begin cost exercise
Day 6: DigitalOcean "Beyond Vector Databases" article, continue cost exercise
Day 7: Complete cost exercise, create reference architecture diagram

Days 8-10: Token Economics & Context Management

Primary Resource:

Karpathy's "Let's build the GPT Tokenizer" (YouTube, ~2 hours)

https://www.youtube.com/watch?v=zduSFxRajkE
Watch at 1.25-1.5x speed
Focus on: why tokenization matters, BPE algorithm intuition, token count implications
Skip the detailed coding sections unless you find them clarifying

Supplementary:

Anthropic's Prompt Caching Documentation

https://platform.claude.com/docs/en/build-with-claude/prompt-caching
~15 minutes of reading
Focus on cost reduction strategies and when caching helps

OpenAI's Token Counting Tool

https://platform.openai.com/tokenizer
Experiment with different text types (code, prose, non-English)
Notice token count variations

Why this matters: Token economics drive cost and feasibility. You need to understand: (1) why tokens ≠ words, (2) how context window size affects what's possible, (3) prompt caching mechanics and cost implications, (4) when to use smaller vs. larger context windows, and (5) cost optimization strategies (batching, caching, model selection). This knowledge is critical for evaluating whether proposed solutions are economically viable at scale.

Key concepts:

Tokenization fundamentals: Byte Pair Encoding (BPE), subword tokens
Token count implications: code uses more tokens than prose, non-English text varies
Context window vs. useful context (just because you can fit 200k tokens doesn't mean you should)
Prompt caching: how it works, when it saves money, what doesn't cache
Cost structures: input tokens, output tokens, cached tokens, rate limits
Conversation design for efficiency: minimize redundant context, use caching strategically
Model selection based on task: when to use Haiku vs. Sonnet vs. Opus

Practical exercise (1-2 hours):

Build a token economics spreadsheet:

Scenario: Customer service chatbot handling 100k conversations/month

Variables to model:

Average conversation length (turns)
Context needed per turn (customer history, product docs)
With/without prompt caching
Model selection (Haiku, Sonnet, Opus)

Calculate:

Monthly token consumption (input, output, cached)
Monthly costs per model choice
Break-even point for caching implementation
ROI of conversation design optimization

Daily breakdown:

Day 8: Karpathy tokenizer video (first half)
Day 9: Karpathy tokenizer video (second half), Anthropic caching docs
Day 10: Token economics spreadsheet exercise, experiment with OpenAI tokenizer

Days 11-13: Tool Usage & Agentic Patterns

Primary Resource:

Anthropic's "Tool Use (Function Calling)" Documentation

https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
Comprehensive official docs
Read all sections, ~1 hour total
Focus on: how models decide to use tools, tool call formatting, error handling, multi-tool orchestration

Supplementary Video:

"Function Calling is All You Need -- Full Day Workshop with Ilan Bigio of OpenAI" (YouTube, ~1.75 hr)

https://www.youtube.com/watch?v=KUEmEb71vzQ&list=WL
Workshop from someone who built OpenAI's 2024 AI phone ordering demo and led technical development for Swarm
Covers: function calling history, agent loop architecture, memory management, delegation patterns, async delegation, and dynamic agent-written tools
Includes hands-on demonstrations with the Swarm framework

Why this matters: "AI agents" proposals often boil down to tool usage patterns. You need to understand: (1) how LLMs decide when to use tools vs. respond directly, (2) tool orchestration complexity, (3) error handling and retry logic, (4) security/safety considerations, and (5) realistic team requirements for building and maintaining agentic systems. This helps you evaluate "we'll build an AI agent that..." proposals.

Key concepts:

Tool use mechanics: LLM outputs structured tool call, environment executes, result fed back to LLM
Decision logic: when models choose to use tools (based on training + instructions)
Tool definition patterns: clear descriptions, parameter schemas, example use cases
Multi-tool orchestration: sequential vs. parallel, dependency handling
Error handling: tool failures, malformed responses, retry strategies
Security: tool access controls, rate limiting, sandboxing
Tradeoffs: tool use vs. in-context information (latency, complexity, reliability)

Architectural analysis exercise (2-3 hours):

Evaluate a proposed "AI agent" architecture:

Scenario: "We'll build an AI agent that monitors student assignment submissions, checks for plagiarism using TurnItIn API, analyzes writing quality, and posts feedback to our LMS"

Your evaluation should cover:

Tool requirements: what tools does the LLM need?
Orchestration complexity: sequential dependencies, error cases
Developer requirements: junior vs. senior, full-stack vs. specialized
Testing strategy: unit tests for tools, integration tests for orchestration
Failure modes: API downtime, ambiguous cases, hallucinated tool calls
Cost structure: LLM calls, tool API costs, error retries
Alternative approaches: could this be simpler without "agentic" framing?

Daily breakdown:

Day 11: Anthropic tool use docs (introduction, basics, advanced patterns)
Day 12: Video supplement, start architectural analysis exercise
Day 13: Complete architectural analysis exercise, create decision framework

Days 14-15: Critical Evaluation & BS Detection

Primary Resources:

"Hallucination Detection Strategies" (multiple sources to triangulate)

Search for: "LLM hallucination detection 2024 2025"
Read 2-3 recent articles/papers (30-45 min total)
Focus on practical detection methods, not just definitions

LLM Benchmarks & Leaderboards

Browse: LMSYS Chatbot Arena (https://chat.lmsys.org/?leaderboard)
Read: How to interpret Elo ratings, what benchmarks actually measure
~30 minutes

Anthropic's "Evaluating Claude" Documentation

https://platform.claude.com/docs/en/test-and-evaluate/develop-tests
Practical guide to output validation
~20 minutes

Why this matters: The most valuable skill is knowing when AI is appropriate vs. traditional approaches, and detecting when proposals make impossible claims. You need to: (1) recognize hallucination patterns, (2) understand output validation techniques, (3) know what benchmarks actually measure vs. what they claim, and (4) identify when proposals contradict technical constraints.

Key concepts:

Hallucination types: factual errors, fabricated citations, confident incorrectness
Detection strategies: fact-checking, consistency checks, citation verification, ensemble methods
When to use AI: ambiguous/creative tasks, natural language interface valuable, acceptable error rate
When NOT to use AI: need 100% accuracy, deterministic logic required, liability for errors
Benchmark limitations: what MMLU/HumanEval/etc. actually test, distribution shift
Impossible claims: "perfect accuracy on subjective tasks," "eliminates need for human review," "real-time without lag"
Output validation: structured outputs, confidence scoring, human-in-the-loop

Synthesis exercise (2-3 hours):

Create a "BS Detection Checklist" for AI proposals:

Section 1: Technical Red Flags

Claims that violate known constraints (e.g., "real-time analysis of video with no latency")
Misunderstanding of model capabilities ("it understands like humans")
Benchmark misinterpretation

Section 2: Economic Red Flags

Underestimated token costs at scale
Ignored API rate limits
Missing ongoing maintenance costs

Section 3: Architectural Red Flags

Over-complicated "agent" when simple prompt would work
RAG when fine-tuning makes more sense
No validation/testing strategy

Section 4: Evaluation Questions to Ask

How will you measure success?
What's your validation strategy?
What happens when it's wrong?
What's the cost at 10x scale?
Why AI instead of traditional approach?

Daily breakdown:

Day 14: Read hallucination detection articles, explore LLM leaderboards, Anthropic eval docs
Day 15: Create BS Detection Checklist, review all Phase 1 notes

Phase 1 Summary & Pre-Meeting Prep

After Day 15, you should be able to:

Explain how RAG works and when it's appropriate vs. alternatives
Estimate token costs and identify economic red flags in proposals
Evaluate tool usage architectures and spot over-complicated "agent" designs
Ask intelligent questions about validation, failure modes, and scalability
Recognize claims that contradict technical constraints

Pre-meeting review (30 minutes):

Skim your notes from Days 1-15
Review your RAG cost analysis exercise
Review your BS Detection Checklist
Prepare 3-5 clarifying questions you might ask about AI proposals

Reference Materials (Keep Accessible)

Essential Documentation

Resource	Purpose	URL
Anthropic API Docs	Tool use, caching, models	https://docs.anthropic.com
OpenAI Platform Docs	Embeddings, fine-tuning	https://platform.openai.com/docs
MCP Specification	Protocol details	https://modelcontextprotocol.io
Pinecone RAG Guide	RAG best practices	https://www.pinecone.io/learn/

Video Resources

Karpathy's "Deep Dive into LLMs" (3.5 hours): https://youtube.com/watch?v=7xTGNNLPyMI
Karpathy's "Let's Build GPT Tokenizer" (2 hours): https://youtube.com/watch?v=zduSFxRajkE
3Blue1Brown Neural Networks Playlist: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
StatQuest Machine Learning Playlist: https://www.youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF

Cost Calculators & Tools

OpenAI Tokenizer: https://platform.openai.com/tokenizer
Anthropic Pricing: https://www.anthropic.com/pricing
Model comparison (LMSYS): https://chat.lmsys.org/?leaderboard

Your Created Materials

Keep these in an accessible reference folder:

RAG Cost Analysis Exercise (Phase 1, Days 4-7)
Token Economics Spreadsheet (Phase 1, Days 8-10)
BS Detection Checklist (Phase 1, Days 14-15)
MCP Implementation Assessment (Phase 3, Days 1-3)
Production RAG Checklist (Phase 4, Week 6)
All Phase 5 Decision Frameworks

Pacing Notes & Adjustments

If you're moving faster:

Deep dive into Karpathy's full "Neural Networks: Zero to Hero" course
Implement actual RAG system (LangChain + Chroma + OpenAI embeddings)
Take fast.ai full Practical Deep Learning course
Build actual MCP server for a real use case

If you're moving slower:

Phase 1 is the priority—extend it to 3 weeks if needed
Phase 2 (foundations) can be compressed or skipped if time-pressured
Phases 4-5 can be done "on-demand" when you encounter those specific needs
Focus on exercises over reading—hands-on builds intuition faster

The key metric: Can you evaluate an AI solution proposal and write a 1-page technical assessment covering: viability, cost structure, failure modes, alternative approaches, and team requirements? That's the goal.

Cost Summary

Resource	Cost
All video courses (YouTube, fast.ai, Coursera auditing)	Free
Documentation (Anthropic, OpenAI, Microsoft, etc.)	Free
API experimentation (OpenAI, Anthropic playgrounds)	~$5-10 (optional)
Optional: Coursera verified certificates	~$49 each
Optional: Hands-on RAG implementation	~$20 (API credits)

Minimum cost: $0 (all core resources are free; API experimentation is optional)

Success Indicators by Phase

After Phase 1 (Pre-meeting):

You can explain RAG to a non-technical executive and identify when it's appropriate
You can estimate token costs for a proposed AI solution and spot economic red flags
You can distinguish between genuine architectural complexity and unnecessary "agentic" framing
You have a checklist of questions to ask about any AI proposal

After Phase 2 (Foundations):

You understand why fine-tuning differs from RAG at a mechanical level
You can explain when more training data helps vs. when it doesn't
You understand model behavior (sampling, temperature) well enough to configure systems appropriately

After Phase 3 (MCP & Claude Tooling):

You can evaluate MCP server proposals and estimate implementation effort
You understand when to use system prompts/skills vs. RAG for knowledge injection
You know what's possible with computer use and what requires custom infrastructure

After Phase 4 (Production RAG):

You can design evaluation frameworks for RAG systems
You understand production considerations beyond MVP (monitoring, iteration, cost optimization)
You can recommend specific architectural patterns for RAG use cases

After Phase 5 (Decision Frameworks):

You have reusable frameworks for rapid evaluation of AI proposals
You can generate technical assessments of proposals in <30 minutes
You can confidently recommend offshore-suitable vs. senior-required work
You maintain technical credibility while translating between technical and business stakeholders

Meta Notes on Learning Approach

Why this structure:

Front-loaded actionability: Phase 1 gets you to "credible evaluator" in 15 days, even though it's pedagogically backwards
Foundations when they're most useful: After seeing practical applications, foundations make more sense
Exercise-heavy: Each phase includes hands-on work because concepts without application don't stick
Reference-optimized: Materials chosen for ongoing utility, not just one-time reading
Economic focus: Unusual for learning plans, but critical for your role as solution architect

Learning philosophy: You're not trying to become an ML engineer—you're building "informed buyer" expertise. The goal is knowing enough to ask the right questions, spot impossible claims, and translate between technical possibilities and business requirements. This requires deeper understanding than typical "intro to AI" content, but different depth than an implementer needs.