Fast food chains track drive-thru latency obsessively. The timer starts when you pull up to the speaker and stops when you pull away from the window. The industry benchmark is around 90 seconds. Why? Because studies showed that latency beyond that threshold causes meaningful drops in customer satisfaction and repeat business. The timer is not a measure of how long food preparation takes. It is a measure of how long the customer waits. The metric exists because it predicts customer behavior.
AI system latency works the same way. The metric that matters is not model inference time in isolation. It is end-to-end response time as perceived by the user. Token generation speed, network overhead, retrieval time, output parsing, all of it collapses into the single number that determines whether the system feels fast or slow to the person waiting. Users do not experience your model’s tokens-per-second benchmark. They experience the time from pressing enter to seeing a complete answer.
What Drives Perceived Latency
For simple queries, time to first token matters more than total generation time. Users experience a streaming response as faster than a delayed block response, even if the total time is identical. The start of visible output provides feedback that progress is happening. A system that generates 500 tokens in 10 seconds but starts showing tokens at 1 second feels faster than a system that generates the same 500 tokens in 8 seconds but delivers them all at once after 8 seconds of silence.
This is why streaming token delivery is a common optimization. Even if total generation time is unchanged, showing tokens as they are generated makes the system feel faster. The user sees the response starting, sees it progressing, and perceives less wait time than a system that delivers the full response after a long silence. The perception of speed matters as much as actual speed.
For complex queries, users expect longer wait times and are more tolerant. The danger is unexpected latency: a simple question that takes five seconds feels worse than a complex question that takes ten seconds. The expectation set by query complexity affects perceived latency more than actual latency. A system that is consistently fast for simple queries but occasionally stalls on them feels slower than a system that is consistently moderate for all queries.
Context matters too. A system that responds instantly for most queries but occasionally stalls feels slower than a system with consistent moderate latency. Variance is perceived as unreliability, and unreliability erodes trust even when average latency is acceptable. Users remember the slow responses more than the fast ones.
The Cost of Speed
Optimizing for lower latency often means accepting higher cost or lower quality. Smaller models are faster but less capable. Aggressive caching improves latency for repeated queries but consumes memory and complicates invalidation. Pre-computation of frequent paths trades storage for speed. There is no free lunch; latency optimization involves trade-offs.
Caching is particularly effective for frequently asked questions. If 40% of queries are repeats of the same questions, caching those responses eliminates inference latency entirely for that fraction of traffic. The cache hit rate determines how much latency benefit caching provides. Building a cache that handles 40% of traffic cuts effective latency by roughly 40% for those queries.
Pre-computation works when you can predict queries in advance. A question-answering system might pre-compute answers for known questions during off-peak hours. When the question arrives, the pre-computed answer is returned immediately. This shifts computation from request time to idle time but requires accurate prediction of what questions will be asked. If prediction is wrong, pre-computation is wasted.
Quality trade-offs are often hidden. A model optimized for low latency might take shortcuts that produce less thorough answers. A retrieval system optimized for speed might sacrifice recall. The latency improvement may come with an accuracy cost that is not immediately visible. Users may not notice the degraded quality until it causes real problems.
Where Time Goes
Token generation is usually the largest component of AI latency. The speed is measured in tokens per second, and longer responses take proportionally longer. A 500-token response takes roughly twice as long as a 250-token response at the same generation speed. If your users tend to ask questions that require long answers, token generation dominates.
Different models have different generation speeds. A small model might generate 100 tokens per second. A frontier model might generate 50 tokens per second. For short responses, the difference is minor. For long responses, it compounds. A 1000-token response takes 10 seconds on the fast small model but 20 seconds on the slow frontier model.
Retrieval time is often overlooked. If the system must retrieve documents before generating a response, retrieval latency adds directly to end-to-end latency. Retrieval that takes 200ms before generation starts adds 200ms regardless of how fast generation is. If retrieval is slow, optimizing generation speed does not help.
Network overhead and serialization add fixed costs that are easy to ignore but can dominate for short queries. A system that generates responses in 100ms but adds 150ms of network and processing overhead has worse perceived latency than a system that generates in 200ms with zero overhead. The overhead is invisible in benchmarks that measure only model inference time.
The Variability Problem
Average latency tells you little. A system that consistently responds in 500ms feels faster than a system that responds in 200ms half the time and 1500ms the other half, even if the average is the same. Users experience variance as unreliability. The inconsistent system is perceived as slower even when it is faster on average.
P99 latency (the latency at the 99th percentile) matters for user experience. If 1% of requests take 5 seconds, users who hit that 1% have a terrible experience and may abandon the system. They also tell other users about the experience. A small percentage of very slow responses can disproportionately damage perception.
Understanding what causes high-latency outliers is more valuable than optimizing the average. Often a small percentage of requests hit edge cases that trigger retries, fallbacks, or unusual processing paths. The outlier requests are not representative of normal operation, but they define user experience for the users who encounter them.
Long-tail latency analysis breaks down the distribution to find what causes slow requests. Is it a specific model version? A specific query type? A specific time of day? Finding the cause of outliers lets you address them directly rather than optimizing for averages that obscure the problem. A 10% improvement in average latency that does not affect the 99th percentile is less valuable than a 10% improvement in the 99th percentile.
Setting Latency Budgets
Different applications have different latency requirements. A coding assistant that suggests completions while you type needs low latency; waiting for suggestions interrupts flow. A batch processing system that generates reports overnight has no latency requirement; speed is nice but not critical. A chatbot that answers customer questions needs moderate latency; too slow frustrates customers, but brief waits are acceptable.
Setting latency targets requires understanding user expectations and business requirements. The 90-second drive-thru target did not come from engineering constraints; it came from customer satisfaction research. The latency target for your system should come from user research, not from engineering comfort.
When latency targets conflict with quality targets, the resolution depends on which matters more. If users want fast answers more than accurate answers, optimize for speed. If accuracy matters more, accept slower responses. The trade-off should be explicit, not hidden.
Decision Rules
Optimize latency when:
- User experience research shows latency impacts satisfaction or abandonment
- The task is interactive (chat, real-time assistance)
- Competitor systems establish a latency benchmark users expect
- The cost of latency (in user satisfaction or business impact) exceeds the cost of optimization
Do not over-optimize latency when:
- Accuracy matters more than speed
- The task is batch (no user waiting)
- Latency is already within acceptable bounds and further improvement costs more than it saves
- The trade-off would degrade quality in ways users would notice
Measure:
- Time to first token (for streaming用户体验)
- End-to-end response time (from user request to final output)
- P50, P95, P99 latency (not just average)
- Variability and outliers (what causes high-latency requests)
- Component-level latency (retrieval, inference, processing) to identify bottlenecks
The 90-second timer exists because it predicts customer behavior. Know which latency metrics actually predict behavior in your system before chasing numbers.