LLM Evaluation: Building Frameworks That Catch Real Problems

Why Evaluation Matters More Than You Think

You've fine-tuned your model, optimized your prompts, and everything looks great in testing. Then you deploy—and support tickets start flooding in.

The problem? Your evaluation didn't test what matters. Here's how to build evaluation frameworks that catch real issues.

The Evaluation Pyramid

Level 1: Unit Tests for LLMs

Yes, you can unit test LLM outputs:

def test_sentiment_extraction():
    response = llm.generate("Extract sentiment: 'I love this product!'")
    assert "positive" in response.lower()
    assert "negative" not in response.lower()

def test_json_output():
    response = llm.generate("Return user data as JSON: name=John, age=30")
    data = json.loads(response)
    assert data["name"] == "John"
    assert data["age"] == 30

Level 2: Behavioral Testing

Test specific behaviors across variations:

@pytest.mark.parametrize("input,expected_behavior", [
    ("What's 2+2?", "contains_number"),
    ("Summarize in 3 bullets", "has_bullet_points"),
    ("Translate to Spanish", "is_spanish"),
])
def test_llm_behaviors(input, expected_behavior):
    response = llm.generate(input)
    assert BEHAVIOR_CHECKS[expected_behavior](response)

Level 3: Golden Dataset Evaluation

Maintain a curated set of inputs with expected outputs:

class GoldenDatasetEvaluator:
    def __init__(self, dataset_path: str):
        self.dataset = load_golden_dataset(dataset_path)

    def evaluate(self, model) -> EvalResults:
        results = []
        for item in self.dataset:
            response = model.generate(item.input)
            score = self.score_response(response, item.expected)
            results.append(EvalResult(item.id, score, response))

        return EvalResults(results)

Level 4: LLM-as-Judge

Use a stronger model to evaluate outputs:

JUDGE_PROMPT = """
Rate the following response on a scale of 1-5 for:
- Accuracy: Does it contain factual errors?
- Relevance: Does it address the question?
- Completeness: Does it cover all aspects?

Question: {question}
Response: {response}

Provide ratings and brief justifications.
"""

async def llm_judge(question: str, response: str) -> JudgeResult:
    judgment = await judge_model.generate(
        JUDGE_PROMPT.format(question=question, response=response)
    )
    return parse_judgment(judgment)

Catching Hallucinations

Hallucination detection requires multiple strategies:

1. Self-Consistency Checking

Generate multiple responses and check for agreement:

async def check_consistency(prompt: str, n: int = 5) -> float:
    responses = await asyncio.gather(*[
        llm.generate(prompt) for _ in range(n)
    ])

    # Check semantic similarity between responses
    embeddings = embed_responses(responses)
    similarities = compute_pairwise_similarity(embeddings)

    return similarities.mean()

2. Source Grounding

For RAG systems, verify claims against sources:

def verify_grounding(response: str, sources: list[str]) -> GroundingScore:
    claims = extract_claims(response)

    grounded_claims = 0
    for claim in claims:
        if any(supports_claim(source, claim) for source in sources):
            grounded_claims += 1

    return GroundingScore(
        total_claims=len(claims),
        grounded_claims=grounded_claims,
        score=grounded_claims / len(claims)
    )

Continuous Evaluation Pipeline

Evaluation isn't a one-time thing. Build it into your CI/CD:

Pre-commit: Run unit tests on prompt changes
PR checks: Run behavioral tests and golden dataset evaluation
Staging: Full evaluation suite with LLM-as-judge
Production: Monitor real-time quality metrics

Metrics That Matter

Track these metrics in production:

Response latency: p50, p95, p99
Token efficiency: Output tokens / input tokens
User satisfaction: Thumbs up/down, regeneration rate
Task completion: Did the user achieve their goal?
Hallucination rate: Verified incorrect statements

Conclusion

Good evaluation is the difference between "it works in testing" and "it works in production." Invest in comprehensive evaluation frameworks early—your users will thank you.