Learning Plan: AI/LLM Practical Applications for Solution Architecture
Goal: Develop detailed conceptual understanding of AI solution architectures to evaluate technical and economic viability, maintain technical credibility as a sanity-check resource, and position strategically in an AI-transformed landscape.
Target Depth: "AI-literate decision-maker and architect" - sufficient understanding to evaluate whether proposed solutions make sense before they're built, identify architectural limitations vs. implementation problems, and translate between technical possibilities and business requirements.
Time Commitment: 1 hour/day, sustained learning
Background: 15 years in education data/tech consulting, familiar with Karpathy's LLM content, regular Claude/ChatGPT user, data engineering background
Note on Structure: Phase 1 is designed to be completable in ~15 days before your strategy meeting. It front-loads actionable architectural knowledge. Phases 2-5 build deeper foundations and expand into specialized topics.
Phase 1: Fast Track to AI Architecture Credibility (Weeks 1-2, ~15 days)
Purpose: Equip you to evaluate RAG proposals, understand token economics, and assess tool usage architectures before your meeting. You'll learn just enough about embeddings and LLM mechanics to ask intelligent questions and spot architectural red flags.
Days 1-3: Embeddings & Vector Search Fundamentals
Primary Resource:
- MIT Lecture: "Embeddings, RAG, and Vector Databases" (Mike Cafarella, March 2024)
- PDF: https://dsg.csail.mit.edu/6.S079/lectures/lec10-s24.pdf
- This is a ~50-slide academic lecture deck
- Read through systematically, ~20 minutes per session
- Focus on slides covering: what embeddings are, how similarity search works, trade-offs in embedding model selection
Supplementary Video:
- "What are Embeddings" by StatQuest (search YouTube, ~15 min)
- Watch at 1.25x speed
- Provides intuitive visual explanation of how words become vectors
Why this matters: When someone proposes "embedding all your documents in a vector database," you need to understand: (1) why vectors enable semantic search vs. keyword search, (2) what "distance" means and why it matters, (3) embedding model implications (size, cost, vocabulary), and (4) retrieval precision limitations. This is the foundation for evaluating RAG proposals.
Key concepts to master:
- Embeddings as semantic representations (king - man + woman ≈ queen)
- Vector similarity metrics (cosine similarity, euclidean distance)
- Embedding model trade-offs (performance vs. cost vs. dimensionality)
- Why embeddings can't be reversed back to original text
- Vocabulary limitations and out-of-vocabulary handling
Hands-on exercise (1-2 hours):
Option A: Use OpenAI's Embeddings API playground
- Create embeddings for 5-10 related phrases
- Visualize similarity scores
- Change one word and observe embedding changes
Option B: Follow Microsoft's "Generate Embeddings" tutorial
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-generate-embeddings
- Read through code examples without necessarily running them
- Focus on understanding the pipeline: text → chunks → embeddings → storage
Daily breakdown:
- Day 1: MIT lecture slides 1-25 (embeddings fundamentals), StatQuest video
- Day 2: MIT lecture slides 26-50 (vector databases, similarity search)
- Day 3: Hands-on exercise + review notes
Success metric: Can you explain to a colleague why "search by meaning" works differently than "search by keywords" and what the cost/performance implications are?
Days 4-7: RAG Architecture End-to-End
Primary Resource:
- Microsoft Generative AI for Beginners: RAG & Vector Databases
- https://github.com/microsoft/generative-ai-for-beginners/blob/main/15-rag-and-vector-databases/README.md
- Free, comprehensive, includes code examples
- Read thoroughly, ~45 minutes per session
- Skip the detailed implementation code, focus on architecture diagrams and decision points
Supplementary:
"Beyond Vector Databases: RAG Without Embeddings" (DigitalOcean, Aug 2025)
- https://www.digitalocean.com/community/tutorials/beyond-vector-databases-rag-without-embeddings
- Read sections on BM25, GraphRAG, Prompt-RAG
- ~30 minutes
- Helps you understand when RAG without embeddings makes sense
Why this matters: RAG is the most common proposal you'll evaluate. You need to understand: (1) the full pipeline (ingestion → chunking → embedding → retrieval → generation), (2) where costs accumulate, (3) common failure modes (retrieval precision, context window limits), (4) when RAG is appropriate vs. fine-tuning or long-context models, and (5) architectural alternatives.
Key concepts:
- RAG pipeline stages: prepare → embed → store → retrieve → augment → generate
- Chunking strategies (fixed size vs. semantic, overlap considerations)
- Vector database options (Pinecone, Chroma, Weaviate, Qdrant, pgvector) and selection criteria
- Retrieval strategies: similarity search, hybrid search (keyword + semantic), reranking
- Cost structure: embedding API calls, vector storage, retrieval latency, LLM inference
- Failure modes: poor chunking loses context, retrieval misses relevant docs, hallucination despite grounding
- When to use: dynamic knowledge bases, need for citations, cost-prohibitive to fine-tune
- When NOT to use: stable/small knowledge base, need deterministic responses, ultra-low latency requirements
Analytical exercise (2-3 hours total across days):
Create a cost/architecture analysis document for a hypothetical RAG system:
Scenario: Education company wants to build "AI tutor" that answers questions using 10,000 textbook PDFs
Your analysis should include:
- Chunking approach recommendation and rationale
- Embedding model selection (OpenAI vs. open-source, dimension count)
- Vector database choice and why
- Estimated costs: embedding generation (one-time), storage (ongoing), retrieval+generation (per query)
- Identified failure modes specific to this use case
- When this should be RAG vs. fine-tuning vs. long-context Claude
This exercise forces you to make real architectural decisions with trade-off justifications.
Daily breakdown:
- Day 4: Microsoft tutorial sections 1-3 (RAG overview, architecture, ingestion)
- Day 5: Microsoft tutorial sections 4-5 (retrieval, generation), begin cost exercise
- Day 6: DigitalOcean "Beyond Vector Databases" article, continue cost exercise
- Day 7: Complete cost exercise, create reference architecture diagram
Days 8-10: Token Economics & Context Management
Primary Resource:
Karpathy's "Let's build the GPT Tokenizer" (YouTube, ~2 hours)
- https://www.youtube.com/watch?v=zduSFxRajkE
- Watch at 1.25-1.5x speed
- Focus on: why tokenization matters, BPE algorithm intuition, token count implications
- Skip the detailed coding sections unless you find them clarifying
Supplementary:
Anthropic's Prompt Caching Documentation
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- ~15 minutes of reading
- Focus on cost reduction strategies and when caching helps
OpenAI's Token Counting Tool
- https://platform.openai.com/tokenizer
- Experiment with different text types (code, prose, non-English)
- Notice token count variations
Why this matters: Token economics drive cost and feasibility. You need to understand: (1) why tokens ≠ words, (2) how context window size affects what's possible, (3) prompt caching mechanics and cost implications, (4) when to use smaller vs. larger context windows, and (5) cost optimization strategies (batching, caching, model selection). This knowledge is critical for evaluating whether proposed solutions are economically viable at scale.
Key concepts:
- Tokenization fundamentals: Byte Pair Encoding (BPE), subword tokens
- Token count implications: code uses more tokens than prose, non-English text varies
- Context window vs. useful context (just because you can fit 200k tokens doesn't mean you should)
- Prompt caching: how it works, when it saves money, what doesn't cache
- Cost structures: input tokens, output tokens, cached tokens, rate limits
- Conversation design for efficiency: minimize redundant context, use caching strategically
- Model selection based on task: when to use Haiku vs. Sonnet vs. Opus
Practical exercise (1-2 hours):
Build a token economics spreadsheet:
Scenario: Customer service chatbot handling 100k conversations/month
Variables to model:
- Average conversation length (turns)
- Context needed per turn (customer history, product docs)
- With/without prompt caching
- Model selection (Haiku, Sonnet, Opus)
Calculate:
- Monthly token consumption (input, output, cached)
- Monthly costs per model choice
- Break-even point for caching implementation
- ROI of conversation design optimization
Daily breakdown:
- Day 8: Karpathy tokenizer video (first half)
- Day 9: Karpathy tokenizer video (second half), Anthropic caching docs
- Day 10: Token economics spreadsheet exercise, experiment with OpenAI tokenizer
Days 11-13: Tool Usage & Agentic Patterns
Primary Resource:
Anthropic's "Tool Use (Function Calling)" Documentation
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
- Comprehensive official docs
- Read all sections, ~1 hour total
- Focus on: how models decide to use tools, tool call formatting, error handling, multi-tool orchestration
Supplementary Video:
"Function Calling is All You Need -- Full Day Workshop with Ilan Bigio of OpenAI" (YouTube, ~1.75 hr)
- https://www.youtube.com/watch?v=KUEmEb71vzQ&list=WL
- Workshop from someone who built OpenAI's 2024 AI phone ordering demo and led technical development for Swarm
- Covers: function calling history, agent loop architecture, memory management, delegation patterns, async delegation, and dynamic agent-written tools
- Includes hands-on demonstrations with the Swarm framework
Why this matters: "AI agents" proposals often boil down to tool usage patterns. You need to understand: (1) how LLMs decide when to use tools vs. respond directly, (2) tool orchestration complexity, (3) error handling and retry logic, (4) security/safety considerations, and (5) realistic team requirements for building and maintaining agentic systems. This helps you evaluate "we'll build an AI agent that..." proposals.
Key concepts:
- Tool use mechanics: LLM outputs structured tool call, environment executes, result fed back to LLM
- Decision logic: when models choose to use tools (based on training + instructions)
- Tool definition patterns: clear descriptions, parameter schemas, example use cases
- Multi-tool orchestration: sequential vs. parallel, dependency handling
- Error handling: tool failures, malformed responses, retry strategies
- Security: tool access controls, rate limiting, sandboxing
- Tradeoffs: tool use vs. in-context information (latency, complexity, reliability)
Architectural analysis exercise (2-3 hours):
Evaluate a proposed "AI agent" architecture:
Scenario: "We'll build an AI agent that monitors student assignment submissions, checks for plagiarism using TurnItIn API, analyzes writing quality, and posts feedback to our LMS"
Your evaluation should cover:
- Tool requirements: what tools does the LLM need?
- Orchestration complexity: sequential dependencies, error cases
- Developer requirements: junior vs. senior, full-stack vs. specialized
- Testing strategy: unit tests for tools, integration tests for orchestration
- Failure modes: API downtime, ambiguous cases, hallucinated tool calls
- Cost structure: LLM calls, tool API costs, error retries
- Alternative approaches: could this be simpler without "agentic" framing?
Daily breakdown:
- Day 11: Anthropic tool use docs (introduction, basics, advanced patterns)
- Day 12: Video supplement, start architectural analysis exercise
- Day 13: Complete architectural analysis exercise, create decision framework
Days 14-15: Critical Evaluation & BS Detection
Primary Resources:
"Hallucination Detection Strategies" (multiple sources to triangulate)
- Search for: "LLM hallucination detection 2024 2025"
- Read 2-3 recent articles/papers (30-45 min total)
- Focus on practical detection methods, not just definitions
LLM Benchmarks & Leaderboards
- Browse: LMSYS Chatbot Arena (https://chat.lmsys.org/?leaderboard)
- Read: How to interpret Elo ratings, what benchmarks actually measure
- ~30 minutes
Anthropic's "Evaluating Claude" Documentation
- https://platform.claude.com/docs/en/test-and-evaluate/develop-tests
- Practical guide to output validation
- ~20 minutes
Why this matters: The most valuable skill is knowing when AI is appropriate vs. traditional approaches, and detecting when proposals make impossible claims. You need to: (1) recognize hallucination patterns, (2) understand output validation techniques, (3) know what benchmarks actually measure vs. what they claim, and (4) identify when proposals contradict technical constraints.
Key concepts:
- Hallucination types: factual errors, fabricated citations, confident incorrectness
- Detection strategies: fact-checking, consistency checks, citation verification, ensemble methods
- When to use AI: ambiguous/creative tasks, natural language interface valuable, acceptable error rate
- When NOT to use AI: need 100% accuracy, deterministic logic required, liability for errors
- Benchmark limitations: what MMLU/HumanEval/etc. actually test, distribution shift
- Impossible claims: "perfect accuracy on subjective tasks," "eliminates need for human review," "real-time without lag"
- Output validation: structured outputs, confidence scoring, human-in-the-loop
Synthesis exercise (2-3 hours):
Create a "BS Detection Checklist" for AI proposals:
Section 1: Technical Red Flags
- Claims that violate known constraints (e.g., "real-time analysis of video with no latency")
- Misunderstanding of model capabilities ("it understands like humans")
- Benchmark misinterpretation
Section 2: Economic Red Flags
- Underestimated token costs at scale
- Ignored API rate limits
- Missing ongoing maintenance costs
Section 3: Architectural Red Flags
- Over-complicated "agent" when simple prompt would work
- RAG when fine-tuning makes more sense
- No validation/testing strategy
Section 4: Evaluation Questions to Ask
- How will you measure success?
- What's your validation strategy?
- What happens when it's wrong?
- What's the cost at 10x scale?
- Why AI instead of traditional approach?
Daily breakdown:
- Day 14: Read hallucination detection articles, explore LLM leaderboards, Anthropic eval docs
- Day 15: Create BS Detection Checklist, review all Phase 1 notes
Phase 1 Summary & Pre-Meeting Prep
After Day 15, you should be able to:
- Explain how RAG works and when it's appropriate vs. alternatives
- Estimate token costs and identify economic red flags in proposals
- Evaluate tool usage architectures and spot over-complicated "agent" designs
- Ask intelligent questions about validation, failure modes, and scalability
- Recognize claims that contradict technical constraints
Pre-meeting review (30 minutes):
- Skim your notes from Days 1-15
- Review your RAG cost analysis exercise
- Review your BS Detection Checklist
- Prepare 3-5 clarifying questions you might ask about AI proposals
Reference Materials (Keep Accessible)
Essential Documentation
| Resource | Purpose | URL |
|---|---|---|
| Anthropic API Docs | Tool use, caching, models | https://docs.anthropic.com |
| OpenAI Platform Docs | Embeddings, fine-tuning | https://platform.openai.com/docs |
| MCP Specification | Protocol details | https://modelcontextprotocol.io |
| Pinecone RAG Guide | RAG best practices | https://www.pinecone.io/learn/ |
Video Resources
- Karpathy's "Deep Dive into LLMs" (3.5 hours): https://youtube.com/watch?v=7xTGNNLPyMI
- Karpathy's "Let's Build GPT Tokenizer" (2 hours): https://youtube.com/watch?v=zduSFxRajkE
- 3Blue1Brown Neural Networks Playlist: https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
- StatQuest Machine Learning Playlist: https://www.youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF
Cost Calculators & Tools
- OpenAI Tokenizer: https://platform.openai.com/tokenizer
- Anthropic Pricing: https://www.anthropic.com/pricing
- Model comparison (LMSYS): https://chat.lmsys.org/?leaderboard
Your Created Materials
Keep these in an accessible reference folder:
- RAG Cost Analysis Exercise (Phase 1, Days 4-7)
- Token Economics Spreadsheet (Phase 1, Days 8-10)
- BS Detection Checklist (Phase 1, Days 14-15)
- MCP Implementation Assessment (Phase 3, Days 1-3)
- Production RAG Checklist (Phase 4, Week 6)
- All Phase 5 Decision Frameworks
Pacing Notes & Adjustments
If you're moving faster:
- Deep dive into Karpathy's full "Neural Networks: Zero to Hero" course
- Implement actual RAG system (LangChain + Chroma + OpenAI embeddings)
- Take fast.ai full Practical Deep Learning course
- Build actual MCP server for a real use case
If you're moving slower:
- Phase 1 is the priority—extend it to 3 weeks if needed
- Phase 2 (foundations) can be compressed or skipped if time-pressured
- Phases 4-5 can be done "on-demand" when you encounter those specific needs
- Focus on exercises over reading—hands-on builds intuition faster
The key metric: Can you evaluate an AI solution proposal and write a 1-page technical assessment covering: viability, cost structure, failure modes, alternative approaches, and team requirements? That's the goal.
Cost Summary
| Resource | Cost |
|---|---|
| All video courses (YouTube, fast.ai, Coursera auditing) | Free |
| Documentation (Anthropic, OpenAI, Microsoft, etc.) | Free |
| API experimentation (OpenAI, Anthropic playgrounds) | ~$5-10 (optional) |
| Optional: Coursera verified certificates | ~$49 each |
| Optional: Hands-on RAG implementation | ~$20 (API credits) |
Minimum cost: $0 (all core resources are free; API experimentation is optional)
Success Indicators by Phase
After Phase 1 (Pre-meeting):
- You can explain RAG to a non-technical executive and identify when it's appropriate
- You can estimate token costs for a proposed AI solution and spot economic red flags
- You can distinguish between genuine architectural complexity and unnecessary "agentic" framing
- You have a checklist of questions to ask about any AI proposal
After Phase 2 (Foundations):
- You understand why fine-tuning differs from RAG at a mechanical level
- You can explain when more training data helps vs. when it doesn't
- You understand model behavior (sampling, temperature) well enough to configure systems appropriately
After Phase 3 (MCP & Claude Tooling):
- You can evaluate MCP server proposals and estimate implementation effort
- You understand when to use system prompts/skills vs. RAG for knowledge injection
- You know what's possible with computer use and what requires custom infrastructure
After Phase 4 (Production RAG):
- You can design evaluation frameworks for RAG systems
- You understand production considerations beyond MVP (monitoring, iteration, cost optimization)
- You can recommend specific architectural patterns for RAG use cases
After Phase 5 (Decision Frameworks):
- You have reusable frameworks for rapid evaluation of AI proposals
- You can generate technical assessments of proposals in <30 minutes
- You can confidently recommend offshore-suitable vs. senior-required work
- You maintain technical credibility while translating between technical and business stakeholders
Meta Notes on Learning Approach
Why this structure:
- Front-loaded actionability: Phase 1 gets you to "credible evaluator" in 15 days, even though it's pedagogically backwards
- Foundations when they're most useful: After seeing practical applications, foundations make more sense
- Exercise-heavy: Each phase includes hands-on work because concepts without application don't stick
- Reference-optimized: Materials chosen for ongoing utility, not just one-time reading
- Economic focus: Unusual for learning plans, but critical for your role as solution architect
Learning philosophy: You're not trying to become an ML engineer—you're building "informed buyer" expertise. The goal is knowing enough to ask the right questions, spot impossible claims, and translate between technical possibilities and business requirements. This requires deeper understanding than typical "intro to AI" content, but different depth than an implementer needs.
