A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaningful lifts in click-through and conversion. But the system had one problem that made the business case collapse: inference latency averaged 1.2 seconds per page load, and spiked above three seconds during peak traffic.
The recommendation engine ran on a feature pipeline that assembled user profiles and product attributes in real time. Every page request triggered a call to the feature pipeline, which pulled data from six different upstream sources — browsing history, purchase history, inventory state, pricing engine output, user segment assignments, and a content tagging service. The pipeline fetched, joined, and transformed this data before passing it to the model for inference. The model itself was fast. The data assembly was not.
The business had already been told three times that the infrastructure was “almost there.” Each re-architecture promised sub-300ms response times. None delivered. The CTO brought us in to determine whether the latency problem was solvable without replacing the entire recommendation stack.
What they tried
The first instinct was to cache the model predictions. The team built a cache keyed on user ID and page type. Cache hit rates during testing looked healthy. In production, they collapsed to under fifteen percent. The reason was straightforward: product inventory, pricing, and user behavior changed frequently enough that cached predictions went stale within minutes. The cache was serving stale recommendations, and the business could measure the revenue impact of showing wrong products.
The second attempt was to pre-compute recommendations on a schedule. Every hour, the system would generate top-N recommendations for every active user and store them in a low-latency lookup. This worked for the top of the user distribution. For the long tail — users who had not visited in days, or who had low interaction history — the pre-computed recommendations were generic. The long tail accounted for over sixty percent of traffic.
The third attempt was to move the feature pipeline to a faster data store. The team migrated from PostgreSQL to Redis for the hot path. Latency improved by about 40 percent, from 1.2 seconds to roughly 700ms. That was meaningful but not enough, and the operational complexity of keeping Redis consistent with the upstream sources introduced a new class of bugs.
The pattern that worked: feature store caching with freshness contracts
The problem was not the cache. The problem was caching at the wrong layer with the wrong invalidation strategy. The team was caching either the final prediction (too high-level, too stale) or the raw feature data (too low-level, still required joins). What they needed was a cache at the feature vector level, with freshness contracts that matched the actual rate of change for each feature.
We designed a feature store layer that sat between the raw data sources and the inference engine. Each feature was tagged with a freshness requirement derived from its actual business semantics:
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Browsing history changed constantly, so it had a five-minute TTL. User segment assignments changed daily, so they had a twenty-four-hour TTL. Inventory state was the most volatile, with a thirty-second TTL tied to the inventory service’s change stream. The key insight was that not all features need to be equally fresh. Treating all features with the same freshness guarantee — which is what the original real-time pipeline did — forced every request to wait for the slowest feature to resolve.
The feature store was implemented as a layered cache. The hot tier held pre-assembled feature vectors in memory, keyed on user ID. A background process continuously refreshed feature vectors for active users, using the TTL contract to determine refresh priority. When a request arrived, the inference engine read the pre-assembled vector from the hot tier. If the vector was within its freshness window, the read completed in under fifty milliseconds. If a specific feature had expired its TTL, only that feature was re-fetched — not the entire vector.
What they gave up
The trade-off was precision for speed. With real-time assembly, every feature value was as current as the upstream source at the moment of inference. With feature store caching, some features were minutes old. For a recommendation engine, this is usually acceptable. For a pricing engine, it might not be.
The second trade-off was operational complexity. The freshness contracts had to be maintained as upstream schemas changed. When a data engineer added a new feature to the browsing history pipeline, someone had to assign a TTL and register it with the feature store. The team built tooling to automate this, but the process added a step to every feature development cycle.
The third trade-off was cold start. For users who had not visited in days, the feature store had no pre-assembled vector. The system fell back to real-time assembly for these users, which meant they still experienced the old latency profile. The team accepted this because cold users were a small percentage of peak traffic, and their latency tolerance was higher anyway.
The results
After six weeks of implementation, p50 inference latency dropped from 1.2 seconds to 110ms. p99 latency dropped from 3.4 seconds to 380ms. Cache hit rates for the feature store held above ninety-two percent during peak traffic. The recommendation engine’s conversion lift was preserved because the freshness contracts ensured that volatile features like inventory and pricing were always within acceptable staleness bounds.
The team also discovered an unexpected benefit. By separating feature assembly from inference, they could iterate on model changes without touching the data pipeline. Model swaps became configuration changes rather than deployment events. Feature additions became cache registration tasks rather than pipeline rewrites.
The decision heuristic
If your inference latency problem is dominated by data assembly rather than model computation, look at the feature layer before scaling compute. The question to ask is: which features actually need to be real-time, and which ones are being fetched in real time out of habit rather than necessity? Most feature pipelines treat all features with the same urgency. That uniformity is where the latency lives. Separate the features by their actual rate of change, cache accordingly, and the inference path gets fast without any changes to the model or the serving infrastructure.