Figure skating judges do not give one score. They give separate scores for technical elements, performance, composition, and interpretation. Each dimension captures something different. A skater can lose on execution but gain on artistic impression. The final score is a weighted combination, not a single judgment. A judge who gives only a single number cannot explain why one skater beat another; the weighting is hidden and the dimensions are collapsed.
AI evaluation works the same way. No single metric tells you whether a model is performing well. Accuracy measures whether the model gets the right answer. Latency measures how fast it responds. Coherence measures whether the output makes sense. Faithfulness measures whether the output matches the input context. Each metric is a dimension; none is sufficient alone. A system with 99% accuracy that takes 10 minutes per query is not better than a 95% accurate system that responds in seconds, unless you have defined what trade-off you actually want.
The Aggregation Problem
When you combine multiple metrics into an overall score, you are making choices about weights. Those choices reflect values. Do you care more about correctness or speed? More about coherence or factual grounding? Different weights produce different rankings of the same models. A composite score that weights latency heavily will rank fast models higher. A composite score that weights accuracy heavily will rank accurate models higher. Neither is wrong; they reflect different priorities.
Using a single composite score without understanding its construction hides these trade-offs. A model that ranks first on your overall metric might be the worst on the one dimension that matters most for your specific application. If you are building a medical diagnosis system, accuracy matters more than latency. A composite score that weights both equally might recommend a model that is mediocre at both.
Consider two models evaluated on three dimensions. Model A scores 90 on accuracy, 70 on latency, 70 on coherence. Model B scores 75 on accuracy, 75 on latency, 75 on coherence. Which is better? It depends entirely on how you weight the dimensions. If accuracy matters most and the others are secondary, Model A wins. If consistency across dimensions matters, Model B wins. Neither answer is objectively correct; they reflect different value judgments embedded in different weighting schemes.
What Gets Measured Gets Optimized
Chasing a single metric is a reliable way to game it. Models trained on BLEU scores learn to produce outputs that score well on BLEU without necessarily being good translations. BLEU measures n-gram overlap with reference translations, so models learn to produce outputs with n-grams that match the reference style, even when the meaning is altered. The metric improved; the translation did not.
Models optimized for perplexity do not automatically produce useful outputs. Perplexity measures how surprised the model is by its own outputs; low perplexity means the model produces predictable text, not necessarily helpful text. The easy metric often wins over the right metric, even when the right metric exists and is known.
Pick metrics that are proxies for actual usefulness in your domain. If your users care about factual accuracy, measure factual accuracy, not just fluency. If they care about following instructions, measure instruction following explicitly. This is harder than it sounds. Factual accuracy is expensive to measure; it requires checking claims against authoritative sources or human reviewers. Fluency is easier to measure; it can be approximated with automated metrics. The easy metric often wins over the right metric.
The metrics you do not measure are the ones your system will neglect. If you measure accuracy but not faithfulness, your model may produce accurate but irrelevant answers. If you measure latency but not throughput, your model may be fast for single queries but degrade under load.
Metric Relationships
Metrics interact in ways that single-dimensional analysis misses. Improving one dimension sometimes degrades another. A model optimized for high factual accuracy might produce more conservative responses, reducing creativity. A model optimized for low latency might use faster, less capable inference paths that sacrifice quality. The improvement in one dimension comes at a cost in another.
Understanding these trade-offs requires measuring all dimensions simultaneously over time. A dashboard that shows accuracy trending up and latency trending down tells you more than either metric alone. A dashboard that shows accuracy trending up while latency trends down but faithfulness to source material is degrading tells you even more. The trade-off is visible only when you track all relevant dimensions.
When a metric improvement comes with correlated degradation elsewhere, you need to decide whether the improvement is worth the cost. Sometimes the degradation is acceptable. If accuracy improves by 5% and faithfulness degrades by 2%, the trade-off may be worthwhile. Sometimes the improvement is illusory, achieved by shifting errors to a dimension you are not measuring. The model learns to look better on the metrics you track by becoming worse on the metrics you ignore.
The Ground Truth Problem
Many metrics require ground truth: labeled examples where you know the correct answer. Accuracy is straightforward to measure when you have labeled test cases. Factual consistency is harder. Does the output agree with the source material? You need someone to verify agreement, or you need a separate system to check claims against trusted references.
Human evaluation is the gold standard but is slow and expensive. You cannot run human evaluation on every output or every model version. Automated metrics provide faster feedback but may not correlate perfectly with human judgment. BLEU scores correlate with human judgments of translation quality in some contexts but not others. ROUGE scores correlate with some summarization judgments but miss others.
The metric you cannot measure is the one you will not improve. If factual accuracy is hard to measure and you only measure fluency, you will get fluent outputs that may not be factual. Teams often optimize for measurable proxies at the expense of unmeasured qualities that actually matter. The result is a system that looks good on the dashboard but disappoints users.
Constructing ground truth for complex tasks is itself a hard problem. For a sentiment classifier, ground truth might be the consensus of human labelers. For a code generator, ground truth might be whether the code passes tests. For a question answerer, ground truth might be whether the answer is correct. Each domain requires different ground truth construction, and each is expensive.
Benchmark Limitations
Public benchmarks like MMLU, HumanEval, or GLUE are useful for comparing models in the abstract but may not reflect your specific use case. A model that performs well on medical board exam questions may perform poorly on legal contract analysis. The benchmark measures what it measures, not necessarily what you care about. Medical board exams test medical knowledge; contract analysis requires legal reasoning and domain knowledge.
Using benchmarks without understanding their relationship to your use case is a category error. You might select a model that performs well on benchmarks but poorly on your actual task. This happens because benchmarks are constructed from publicly available data that may not match your domain, and because benchmark tasks are often simplified versions of real-world tasks.
The right approach is to construct evaluation sets from your actual domain and use cases. If your system summarizes legal documents, your evaluation set should contain legal documents and human-evaluated summaries. If your system answers customer questions, your evaluation set should contain real customer questions and human-evaluated answers. This is more expensive than downloading a benchmark but much more informative. The evaluation set you build is the only one that actually tells you how your system will perform on your task.
Tracking Over Time
Metrics that are measured once tell you about the current state. Metrics that are tracked over time tell you about trends. A model that scores 80% accuracy today may be stable, improving, or degrading. Only longitudinal tracking reveals which.
Establishing baselines is the first step. Before you can判断 whether a change improves the system, you need to know what the system looked like before the change. Baselines should be measured on representative data, not cherry-picked examples. A baseline that uses easy examples will make every change look like an improvement.
Regression testing catches degradation. When you change the model, the prompt, the retrieval system, or any component, you should re-run the full evaluation suite to confirm that changes do not degrade metrics you care about. Automated regression tests that run on every change prevent degradation from accumulating silently.
Decision Rules
Use multi-dimensional evaluation when:
- Your application has multiple quality criteria that are not reducible to one
- You need to track trade-offs explicitly over time
- Different stakeholders care about different dimensions
- The cost of getting any dimension wrong is high
Do not collapse to a single metric when:
- The components measure fundamentally different things
- The weighting is arbitrary or hidden
- You are using a metric as a proxy for something harder to measure
Define metrics that are:
- Direct measures of what users care about, not proxies for those things
- Actionable (you can change behavior based on the measurement)
- Measurable at the frequency you need to make decisions
- Comprehensive enough to catch degradation in unmeasured dimensions
Construct evaluation sets from:
- Your actual domain and use cases
- Representative samples of real inputs
- Human-generated ground truth where automated metrics fall short
- Hard cases, not just obvious ones
A scorecard with one number tells you less than a scorecard with five numbers you understand. Know what each dimension means and why it is weighted the way it is. The metrics you choose define what you optimize for; choose them deliberately.