LLM Evaluation: Building Frameworks That Catch Real Problems

Amit Patel

Amit Patel

ML Platform Lead

January 5, 20259 min read
LLMEvaluationTestingQuality Assurance
LLM Evaluation: Building Frameworks That Catch Real Problems

Why Evaluation Matters More Than You Think

You've fine-tuned your model, optimized your prompts, and everything looks great in testing. Then you deploy—and support tickets start flooding in.

The problem? Your evaluation didn't test what matters. Here's how to build evaluation frameworks that catch real issues.

The Evaluation Pyramid

Level 1: Unit Tests for LLMs

Yes, you can unit test LLM outputs:

def test_sentiment_extraction(): response = llm.generate("Extract sentiment: 'I love this product!'") assert "positive" in response.lower() assert "negative" not in response.lower() def test_json_output(): response = llm.generate("Return user data as JSON: name=John, age=30") data = json.loads(response) assert data["name"] == "John" assert data["age"] == 30

Level 2: Behavioral Testing

Test specific behaviors across variations:

@pytest.mark.parametrize("input,expected_behavior", [ ("What's 2+2?", "contains_number"), ("Summarize in 3 bullets", "has_bullet_points"), ("Translate to Spanish", "is_spanish"), ]) def test_llm_behaviors(input, expected_behavior): response = llm.generate(input) assert BEHAVIOR_CHECKS[expected_behavior](response)

Level 3: Golden Dataset Evaluation

Maintain a curated set of inputs with expected outputs:

class GoldenDatasetEvaluator: def __init__(self, dataset_path: str): self.dataset = load_golden_dataset(dataset_path) def evaluate(self, model) -> EvalResults: results = [] for item in self.dataset: response = model.generate(item.input) score = self.score_response(response, item.expected) results.append(EvalResult(item.id, score, response)) return EvalResults(results)

Level 4: LLM-as-Judge

Use a stronger model to evaluate outputs:

JUDGE_PROMPT = """ Rate the following response on a scale of 1-5 for: - Accuracy: Does it contain factual errors? - Relevance: Does it address the question? - Completeness: Does it cover all aspects? Question: {question} Response: {response} Provide ratings and brief justifications. """ async def llm_judge(question: str, response: str) -> JudgeResult: judgment = await judge_model.generate( JUDGE_PROMPT.format(question=question, response=response) ) return parse_judgment(judgment)

Catching Hallucinations

Hallucination detection requires multiple strategies:

1. Self-Consistency Checking

Generate multiple responses and check for agreement:

async def check_consistency(prompt: str, n: int = 5) -> float: responses = await asyncio.gather(*[ llm.generate(prompt) for _ in range(n) ]) # Check semantic similarity between responses embeddings = embed_responses(responses) similarities = compute_pairwise_similarity(embeddings) return similarities.mean()

2. Source Grounding

For RAG systems, verify claims against sources:

def verify_grounding(response: str, sources: list[str]) -> GroundingScore: claims = extract_claims(response) grounded_claims = 0 for claim in claims: if any(supports_claim(source, claim) for source in sources): grounded_claims += 1 return GroundingScore( total_claims=len(claims), grounded_claims=grounded_claims, score=grounded_claims / len(claims) )

Continuous Evaluation Pipeline

Evaluation isn't a one-time thing. Build it into your CI/CD:

  1. Pre-commit: Run unit tests on prompt changes
  2. PR checks: Run behavioral tests and golden dataset evaluation
  3. Staging: Full evaluation suite with LLM-as-judge
  4. Production: Monitor real-time quality metrics

Metrics That Matter

Track these metrics in production:

  • Response latency: p50, p95, p99
  • Token efficiency: Output tokens / input tokens
  • User satisfaction: Thumbs up/down, regeneration rate
  • Task completion: Did the user achieve their goal?
  • Hallucination rate: Verified incorrect statements

Conclusion

Good evaluation is the difference between "it works in testing" and "it works in production." Invest in comprehensive evaluation frameworks early—your users will thank you.

Share this article:
Back to all posts

Ready to build production AI?

We help companies ship AI systems that actually work. Let's talk about your project.

Start a conversation