The Rise of AI Evals
Two years ago, nobody talked about "evals." Today, they're the most critical skill for AI product builders. Hamel Hussein and Shreya Shankar have transformed this mysterious subject into essential knowledge, teaching over 2,000 professionals across 500 companies including OpenAI and Anthropic teams.
Learn the Process
What Are Evals?
Systematic Measurement
Evals are data analytics for your AI application - a systematic way to measure and improve performance beyond "vibe checks."
Beyond Unit Tests
While similar to unit tests, evals cover the full spectrum: monitoring, error analysis, and quality metrics for open-ended AI tasks.
Confidence in Iteration
Create feedback signals to iterate with confidence, knowing you won't break existing functionality while improving your AI product.
Real-World Example: Property Management AI
Nurture Boss helps property managers with inbound leads, customer service, and booking appointments across chat, text, and voice channels. When users ask about apartment availability, the AI should provide helpful information - not just say "we don't have that, have a nice day."
This real estate assistant demonstrates typical AI application complexity: tool calls for booking appointments, RAG retrieval for customer data, and multi-channel interactions requiring sophisticated evaluation approaches.
Step 1: Error Analysis Through Data
01
Look at Your Traces
Start by examining detailed logs of user interactions. Don't guess what's wrong - look at actual conversations and system responses.
02
Write Quick Notes
Capture the first thing you see that's wrong. Be specific: "should have handed off to human" not just "janky."
03
The Benevolent Dictator
Assign one domain expert to do this analysis. Don't get bogged down in committee decisions - move fast with trusted judgment.
Step 2: From Chaos to Categories
1
Open Coding
Raw notes from 100+ traces: "conversation flow janky," "offered virtual tour," "didn't confirm transfer"
2
AI Categorization
Use LLMs to group similar issues into "axial codes" - actionable failure categories like "Human Handoff Issues"
3
Priority Matrix
Count occurrences to identify biggest problems: 17 conversational flow issues, 8 handoff problems, 5 formatting errors
This process transforms overwhelming data into clear priorities. Most powerful analytical technique: basic counting reveals which problems matter most for your product success.
Step 3: Building LLM-as-Judge Evaluators
Binary Decisions Only
Create pass/fail evaluators, not 1-5 scales. "Should this conversation have been handed off to a human? True or False." Clear criteria prevent confusion and enable action.
Specific Failure Modes
Each evaluator targets one specific problem: explicit human requests, policy violations, sensitive issues, or tool unavailability requiring human intervention.

Never skip validation: Test your LLM judge against human judgment using confusion matrices. Agreement percentage alone can be misleading if errors are rare.
The Eval Ecosystem: More Than Tests
Error Analysis
Manual review of traces to discover failure patterns and unexpected user behaviors
Code-Based Evals
Simple checks for format, length, required fields - cheaper than LLM judges when possible
LLM Judges
AI evaluators for complex, subjective quality measures that require human-like judgment
Production Monitoring
Real-time evaluation of live user interactions, not just pre-deployment testing
Debunking Common Misconceptions
Myth: "Just Buy a Tool"
No tool can automatically eval your specific product without domain expertise. Generic hallucination scores don't capture your unique failure modes.
Myth: "Skip the Data"
Looking at individual traces is the highest ROI activity. You'll always discover unexpected problems that no upfront planning could predict.
Myth: "Evals vs AB Tests"
AB tests ARE evals - they're complementary tools. Ground your experiments in actual error analysis, not hypothetical improvements.
Getting Started: Your First Week
1
Days 1-2: Data Collection
Set up trace logging and export 100+ user interactions. Choose your benevolent dictator - the domain expert who understands quality.
2
Days 3-4: Error Analysis
Review traces, write open codes, use AI to categorize into axial codes. Count occurrences to identify top 3-5 failure modes.
3
Days 5-7: Build Evaluators
Create 4-7 targeted evaluators for your biggest problems. Validate LLM judges against human judgment before deployment.
After this initial investment, spend just 30 minutes weekly maintaining and expanding your eval suite. The ROI is immediate and compound.
The Future of AI Product Development
2000+
Students Trained
PMs and engineers across 500 companies including major AI labs
4-7
Typical Eval Count
Most products need only a handful of targeted evaluators
30min
Weekly Maintenance
Time needed after initial setup to keep improving your product
Evals aren't just about catching bugs - they're the systematic approach to building better AI products. Companies doing this well have a sharp competitive advantage through superior product quality and user experience.