Traditional software monitoring tracks CPU utilization, memory consumption, request rates, and error counts. These metrics tell you whether your service is running and whether it is handling load. They do not tell you whether your AI system is producing outputs that are accurate, appropriate, and valuable to users.
LLM observability is fundamentally harder because output quality is not a binary signal. A response that looks reasonable to an uncritical reader might be confidently wrong in ways that are difficult to detect. A slow response might still be correct and useful. A fast response might be confidently incorrect. Cost spikes might reflect legitimate burst traffic or an unexpected prompt injection attack. You need metrics that map to actual system behavior and user outcomes, not just infrastructure health.
This is a domain where teams either do too little or build the wrong things. Doing too little means you discover quality problems through user complaints or, worse, through embarrassing public failures. Building the wrong things means you have expensive dashboards that nobody looks at because they measure things that do not matter for your users while missing the things that do. The path to effective observability requires understanding what you are actually trying to detect and why.
The Three Observability Axes
Hallucination detection is the hardest problem in AI observability. Hallucinations are confident, plausible-sounding statements that are factually incorrect. They are not random errors with clear signals. They are smooth, confident, and wrong in ways that often pass human review on first reading. A human reviewing the output cannot always tell, which is exactly what makes them dangerous. By the time you realize the model hallucinated, the output may have already influenced a decision or been shared with a customer.
The challenge is that hallucinations are not random. The model hallucinates more frequently on certain types of questions. It hallucinates more when it lacks sufficient context. It hallucinates more when asked about entities that are not well-represented in its training data. Understanding these patterns helps you know where to focus detection effort.
The practical approaches to hallucination detection involve consistency checking and ground truth comparison. Ask the same question multiple times and flag responses that contradict themselves. Compare outputs against authoritative data sources when those sources are accessible and the question type permits. Use uncertainty signals from the model when available, though these signals are imperfect and models are often overconfident.
One approach that works for factual queries with known answers: maintain a ground truth evaluation dataset for questions where correct answers can be determined. Periodically run the evaluation set through your production system and compare outputs against known answers. This catches degradation over time but does not catch new types of errors that are not represented in your evaluation set.
The evaluation set is only as good as the imagination that went into creating it. It cannot anticipate every way a model might be wrong. It can only catch errors that you specifically test for. New error types will not be caught until they appear in production and someone notices.
Another approach: cross-model consistency checking. Ask the same question to two different models and flag responses that contradict each other. This is not foolproof because both models might hallucinate in the same way, especially for obscure factual questions where neither model has reliable knowledge. But it catches errors that single-model evaluation would miss, particularly when one model has stronger knowledge in a specific domain than the other.
No approach catches all hallucinations. The goal is to reduce the rate and catch the most consequential errors before they reach users. If 1% of your responses contain significant hallucinations and you catch half of them before they reach users, you have improved the user experience substantially. You will not achieve perfect detection, so focus on maximizing impact reduction rather than perfect accuracy.
Latency monitoring needs more nuance than reporting p50 and p99 response times. LLM latency varies with context length, model size, and provider-side load in ways that simple aggregations obscure. A response that takes 2 seconds for a short prompt is very different from a 2-second response that includes a 50-page document in the context. Treating them the same in your monitoring tells you nothing useful.
The key insight is that latency should be monitored against context length. Set thresholds that account for both the base latency and the per-token latency. If your median latency at 1,000 tokens is 400 milliseconds and suddenly you see 1,200 milliseconds at the same token count, that is a regression worth investigating. If latency increases proportionally with context length as expected, that might be normal behavior, not a problem.
Track latency percentiles by context length bucket. A p99 latency of 3 seconds for prompts under 1,000 tokens might be acceptable. A p99 latency of 8 seconds for prompts over 10,000 tokens might also be acceptable. But a p99 latency of 5 seconds for prompts under 1,000 tokens is definitely not acceptable and warrants investigation.
Alert on absolute latency violations that affect user experience. If your SLA is 3 seconds for 95% of requests, alert when responses exceed 3 seconds. But also alert on relative regressions: if median latency doubles without a corresponding change in your input distribution, something changed in your provider or your implementation that warrants investigation.
Cost observability requires attribution that most teams underestimate at first. You need to know which users, which features, and which prompts are driving your AI costs. Token counts per request are the starting point, but raw counts do not tell you whether the spend is producing value.
Build dashboards that break down consumption by dimension that maps to your business structure. If you have an AI assistant with multiple features like summarization, classification, and generation, track costs per feature. If you serve multiple business units, track costs per business unit. This attribution is essential for understanding which AI investments are paying off and which are not.
A SaaS company we advised was seeing a 40% month-over-month cost increase in their AI assistant. Initial analysis looked at total token counts, which had indeed increased. But when they broke it down by feature, they found that a new prompt variant for the summarization feature was using 3x more tokens than expected because a developer had accidentally included the entire conversation history in each summarization request instead of just the recent messages that needed summarizing.
Fixing that one bug reduced costs by 30% without affecting output quality. The cost attribution dashboard made the problem visible. Without it, they might have spent months investigating provider pricing changes rather than finding the actual cause.
Cost observability also means detecting anomalies that might indicate abuse or bugs. If a single user account is suddenly responsible for 10% of your AI costs, that might be legitimate heavy usage or it might be a bug that is looping the AI into unintended behavior or an attack that is trying to exhaust your resources.
What to Track
Beyond the three core axes, several secondary metrics provide important signal about system health and emerging problems.
Context utilization tells you how much of your retrieved context the model actually uses. If you are passing 10,000 tokens of retrieved context and the model’s response ignores 80% of it, you are paying for retrieval that is not helping. Low utilization might mean your retrieval is noisy or your prompting is ineffective. High utilization with poor outputs might mean the model is overwhelmed by too much irrelevant context.
Measure utilization by comparing what you provided in the context against what appears in the output. This is imperfect because a response that does not mention a piece of context might still have been informed by it. But unusual patterns are detectable. If you consistently provide context about topic X and responses never reference topic X, something in the pipeline is broken.
Context utilization is a diagnostic metric. Low utilization tells you something is wrong but does not tell you exactly what. High utilization with poor outputs tells you the model might be overwhelmed. Use context utilization in combination with output quality metrics to understand what is happening.
Response length distribution catches prompt injection attempts and unusual model behavior. If your median response is 200 tokens and suddenly you see responses of 2,000 tokens, something changed. That change might be a user submitting an unusually complex query, or it might be an injection attempt that is trying to exhaust your context window or inflate your costs by generating excessive output.
Monitor the distribution, not just averages. A median of 200 with a 95th percentile of 2,000 is very different from a median of 800 with a 95th percentile of 1,000. The first distribution has a long tail that warrants investigation. The second distribution is tighter and more predictable.
When you see a shift in the response length distribution, investigate. It might be legitimate new usage patterns. It might be an attack. It might be a bug that is causing the model to generate verbose responses. Only investigation will tell.
Consistency scoring requires longitudinal tracking across a session or longer. Flag responses that contradict earlier responses from the same conversation or from previous conversations with the same user. Self-contradiction is a useful hallucination signal.
If the model says “Acme Corporation is our largest customer” in one response and “We have no customers named Acme” in another response in the same session, at least one of those statements is wrong. This is a hallucination or a context confusion problem that warrants investigation.
Track consistency at the session level and at the account level. A model that contradicts itself within a single conversation is concerning. A model that contradicts statements it made six months ago might be reflecting model update behavior rather than a current hallucination. Knowing which is which matters for how you respond.
Building Feedback Loops
Automated metrics catch regressions but they do not tell you if users are actually satisfied with the service. User feedback, even simple thumbs up or thumbs down signals, provides ground truth that automated metrics cannot replace. However, users who are satisfied are less likely to provide feedback than users who are frustrated, so raw feedback is biased toward negative signals.
Escalation rate tracks how often users abandon the AI-assisted flow and seek human help instead. This is expensive and it is the metric that ties AI quality to business outcomes. If your AI-assisted support flow has a 20% escalation rate and you reduce it to 10%, you have not just improved an AI metric. You have reduced load on human agents by half.
Escalation rate is a lagging indicator. It tells you what happened after users decided the AI was not helping. It does not tell you what specifically went wrong. Combine it with other signals to understand the root cause of escalations.
Task completion tracks whether users accomplished what they came to do. A low escalation rate with low task completion means users are not getting value, they are just not complaining. They are abandoning the task entirely rather than escalating to a human. This is worse than escalating because it means your AI is failing silently.
Build feedback mechanisms that capture both satisfaction and task completion. A user who gives a thumbs down because the AI was slow has a different problem than a user who gives a thumbs down because the answer was wrong. The first is a latency problem. The second is a quality problem. Different problems require different fixes.
Common Failure Modes
Understanding common observability failure modes helps you avoid investing in the wrong things.
Vanity metrics. Tracking total requests or total tokens per day tells you scale, not quality. A system that serves millions of confident hallucinations is worse than one that serves thousands of accurate responses, but total request counts do not tell you which you have.
Make sure your dashboards answer questions that affect user outcomes, not just questions that make the system look busy or important. If you are not using a metric to make decisions, stop tracking it.
Silent degradation. Models change under the hood without notice. A provider updates a model and behavior shifts slightly in ways that are not announced. Without regular quality audits, you do not notice. Your users notice but you do not have the signal to connect their complaints to the model change.
Set up a gold-standard evaluation set that you run against every production query sample periodically. Track accuracy over time. When you see a drop, correlate it with provider announcements about model updates.
Feedback that is not representative. Users who are frustrated are more likely to leave feedback than users who are satisfied. Raw feedback signals are biased toward negative sentiment. Weight by usage patterns and look for systematic signals rather than individual data points.
If 80% of users who use a specific feature give negative feedback, that is a signal even if your overall feedback is 60% positive. Segment your feedback by feature to find problems that are hidden in aggregate numbers.
Over-alerting. If every potential hallucination triggers an alert, your on-call team will learn to ignore alerts. Define alert thresholds that reflect actual business impact, not theoretical concerns.
A hallucination in a low-stakes context is not worth a page at 3am. A hallucination that could affect a medical or financial decision might be. Calibrate alerts to consequence.
Decision Rules
Monitor hallucination signals through consistency checking and ground truth comparison. Accept that you will not catch all of them. Focus on catching the high-impact ones first. Build evaluation sets for your most consequential query types and run them against production samples regularly.
Track latency against context length, not just raw response time. Alert on regressions that exceed expected variance given input size. A latency increase that is proportional to context growth is probably expected. A latency increase at constant context length is a problem.
Attribute costs to features and users. You cannot control what you cannot measure. Build dashboards that break down consumption by the dimensions that map to your business structure.
Build feedback loops that capture user satisfaction and task completion. Escalation rate is the metric that ties AI quality to business outcomes. A low escalation rate with low task completion is worse than a high escalation rate because it means users are failing silently.
The underlying principle: AI observability is not traditional infrastructure monitoring. The failure modes are different and so are the signals that detect them. Build metrics that map to output quality and user outcomes, not just system health. The goal is to know when your AI is producing wrong answers, not just when it is running slowly. Wrong answers that look confident are more damaging than slow responses that users know are slow.