The Metrics Problem
Your model has 95% accuracy. Great! But users are complaining, adoption is flat, and leadership is asking hard questions.
Model metrics aren't product metrics. Here's how to measure what matters.
The Metrics Hierarchy
Level 1: Model Metrics (Technical)
These tell you if your model works:
- Accuracy, precision, recall, F1
- Latency (p50, p95, p99)
- Token efficiency
- Error rates
Important but insufficient. A model can ace benchmarks and still fail users.
Level 2: Task Metrics (Functional)
These tell you if users complete their goals:
- Task completion rate
- Time to completion
- Error recovery rate
- Retry rate
class TaskMetrics: def track_task(self, task_id: str, user_id: str): return TaskTracker( task_id=task_id, user_id=user_id, started_at=now() ) def complete_task(self, tracker: TaskTracker, success: bool): tracker.completed_at = now() tracker.success = success self.emit_metric("task_completion", { "duration_seconds": tracker.duration, "success": success, "retry_count": tracker.retry_count })
Level 3: User Metrics (Experience)
These tell you if users are satisfied:
- User satisfaction (CSAT, NPS)
- Regeneration rate (user rejected response)
- Feedback sentiment
- Feature adoption
Level 4: Business Metrics (Value)
These tell you if you're delivering ROI:
- Time saved per user
- Cost per task
- Revenue impact
- Support ticket reduction
Implementing Measurement
The Feedback Loop
class AIProductMetrics: async def track_interaction( self, user_id: str, query: str, response: str, context: dict ) -> str: interaction_id = generate_id() # Track immediately available metrics await self.emit({ "interaction_id": interaction_id, "timestamp": now(), "user_id": user_id, "query_length": len(query), "response_length": len(response), "latency_ms": context.get("latency_ms"), "model_confidence": context.get("confidence"), }) return interaction_id async def track_feedback( self, interaction_id: str, feedback_type: str, # "thumbs_up", "thumbs_down", "regenerate" feedback_text: Optional[str] = None ): await self.emit({ "interaction_id": interaction_id, "feedback_type": feedback_type, "feedback_text": feedback_text, "feedback_timestamp": now() })
Implicit Feedback
Users don't always click feedback buttons. Track behavior:
class ImplicitFeedback: def analyze_session(self, session: Session) -> FeedbackSignals: return FeedbackSignals( # Positive signals task_completed=session.reached_completion_state, copied_response=session.had_copy_event, short_session=session.duration < EXPECTED_DURATION, # Negative signals regenerated=session.regeneration_count > 0, abandoned=not session.reached_completion_state, long_session=session.duration > EXPECTED_DURATION * 2, searched_after=session.had_external_search )
Dashboards That Drive Action
Executive Dashboard
Focus on business impact:
- Weekly active users (trend)
- Estimated time saved
- Cost per successful task
- User satisfaction trend
Product Dashboard
Focus on user experience:
- Task completion rate by type
- Regeneration rate (are responses good?)
- Feature adoption (are users finding value?)
- Error rate and types
Engineering Dashboard
Focus on system health:
- Latency percentiles
- Error rates by category
- Model performance metrics
- Infrastructure costs
Setting Targets
Baseline First
Before setting targets, establish baselines:
def establish_baseline(metric: str, days: int = 30) -> Baseline: data = get_metric_data(metric, days) return Baseline( mean=data.mean(), median=data.median(), p95=data.percentile(95), std_dev=data.std(), trend=calculate_trend(data) )
Realistic Improvement Targets
- Task completion: Aim for 5-10% improvement per quarter
- Satisfaction: NPS improvement of 5-10 points is significant
- Latency: 20-30% reduction is achievable with optimization
- Cost: 10-20% reduction through efficiency improvements
A/B Testing AI Features
class AIExperiment: def __init__( self, name: str, variants: dict[str, AIConfig], metrics: list[str], min_sample_size: int = 1000 ): self.name = name self.variants = variants self.metrics = metrics self.min_sample_size = min_sample_size def assign_variant(self, user_id: str) -> str: # Consistent assignment hash_val = hash(f"{self.name}:{user_id}") variant_idx = hash_val % len(self.variants) return list(self.variants.keys())[variant_idx] def analyze(self) -> ExperimentResults: results = {} for metric in self.metrics: control_data = self.get_metric_data("control", metric) treatment_data = self.get_metric_data("treatment", metric) results[metric] = StatisticalTest( control_data, treatment_data ).run() return ExperimentResults(results)
Conclusion
Measuring AI products requires thinking beyond model accuracy. Track task completion, user satisfaction, and business impact. Build feedback loops. Set realistic targets. And always remember: the goal isn't a better model—it's a better product.