Why Evaluation Matters More Than You Think
You've fine-tuned your model, optimized your prompts, and everything looks great in testing. Then you deploy—and support tickets start flooding in.
The problem? Your evaluation didn't test what matters. Here's how to build evaluation frameworks that catch real issues.
The Evaluation Pyramid
Level 1: Unit Tests for LLMs
Yes, you can unit test LLM outputs:
def test_sentiment_extraction(): response = llm.generate("Extract sentiment: 'I love this product!'") assert "positive" in response.lower() assert "negative" not in response.lower() def test_json_output(): response = llm.generate("Return user data as JSON: name=John, age=30") data = json.loads(response) assert data["name"] == "John" assert data["age"] == 30
Level 2: Behavioral Testing
Test specific behaviors across variations:
@pytest.mark.parametrize("input,expected_behavior", [ ("What's 2+2?", "contains_number"), ("Summarize in 3 bullets", "has_bullet_points"), ("Translate to Spanish", "is_spanish"), ]) def test_llm_behaviors(input, expected_behavior): response = llm.generate(input) assert BEHAVIOR_CHECKS[expected_behavior](response)
Level 3: Golden Dataset Evaluation
Maintain a curated set of inputs with expected outputs:
class GoldenDatasetEvaluator: def __init__(self, dataset_path: str): self.dataset = load_golden_dataset(dataset_path) def evaluate(self, model) -> EvalResults: results = [] for item in self.dataset: response = model.generate(item.input) score = self.score_response(response, item.expected) results.append(EvalResult(item.id, score, response)) return EvalResults(results)
Level 4: LLM-as-Judge
Use a stronger model to evaluate outputs:
JUDGE_PROMPT = """ Rate the following response on a scale of 1-5 for: - Accuracy: Does it contain factual errors? - Relevance: Does it address the question? - Completeness: Does it cover all aspects? Question: {question} Response: {response} Provide ratings and brief justifications. """ async def llm_judge(question: str, response: str) -> JudgeResult: judgment = await judge_model.generate( JUDGE_PROMPT.format(question=question, response=response) ) return parse_judgment(judgment)
Catching Hallucinations
Hallucination detection requires multiple strategies:
1. Self-Consistency Checking
Generate multiple responses and check for agreement:
async def check_consistency(prompt: str, n: int = 5) -> float: responses = await asyncio.gather(*[ llm.generate(prompt) for _ in range(n) ]) # Check semantic similarity between responses embeddings = embed_responses(responses) similarities = compute_pairwise_similarity(embeddings) return similarities.mean()
2. Source Grounding
For RAG systems, verify claims against sources:
def verify_grounding(response: str, sources: list[str]) -> GroundingScore: claims = extract_claims(response) grounded_claims = 0 for claim in claims: if any(supports_claim(source, claim) for source in sources): grounded_claims += 1 return GroundingScore( total_claims=len(claims), grounded_claims=grounded_claims, score=grounded_claims / len(claims) )
Continuous Evaluation Pipeline
Evaluation isn't a one-time thing. Build it into your CI/CD:
- Pre-commit: Run unit tests on prompt changes
- PR checks: Run behavioral tests and golden dataset evaluation
- Staging: Full evaluation suite with LLM-as-judge
- Production: Monitor real-time quality metrics
Metrics That Matter
Track these metrics in production:
- Response latency: p50, p95, p99
- Token efficiency: Output tokens / input tokens
- User satisfaction: Thumbs up/down, regeneration rate
- Task completion: Did the user achieve their goal?
- Hallucination rate: Verified incorrect statements
Conclusion
Good evaluation is the difference between "it works in testing" and "it works in production." Invest in comprehensive evaluation frameworks early—your users will thank you.