Case Study: Building a Production AI Knowledge Layer for Financial Services

Simor Consulting | 01 Mar, 2026 | 10 Mins read

A regional bank’s investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data feeds, and analyst notes to build a picture of any investment opportunity. By the time they had gathered everything, they had little time left to think.

The bank wanted AI to shift this balance. Give analysts a single interface that could query across all their information sources, surface relevant context, and summarize key findings. The goal was to reduce research time by half.

Twelve months later, they had a system in production. The actual outcome was 73% reduction in research time, measured across six months of production usage.

The Data Landscape

Financial services data is heterogeneous in the worst way. Understanding the data landscape upfront shaped every architectural decision.

SEC filings are structured documents with XBRL tags, but the actual investment analysis lives in the prose. The XBRL tags provide standardized financial metrics, but the narrative, the risk factors, the management discussion, these are in prose sections that require natural language understanding. Filings update quarterly with earnings, annually with 10-Ks, and ad-hoc for material events. The cadence is predictable but the volume is large.

Research memos are pure prose. Investment thesis, risk analysis, peer comparisons. Written by analysts with varying styles, quality, and focus. Some are comprehensive documents running dozens of pages. Some are quick takes written in fifteen minutes. The quality variance is significant, and the knowledge layer needed to surface that variance without making judgments it could not support.

Market data is numerical and time-series. Pricing, volume, fundamentals, estimates. Updates continuously during trading hours. Requires specialized queries that text-based retrieval cannot handle. A question about revenue growth is not a semantic search problem. It is a time-series lookup problem that requires accessing structured financial data.

Analyst notes are a free-form mixture of observations, calculations, and informal commentary. Often the most valuable source for understanding why something matters, but least structured. An analyst’s note might say “mgmt guided down on margins, not surprised given commodity costs, but the tone was more cautious than last quarter.” Extracting that signal requires understanding tone, context, and implication.

Architecture Decisions

Separate Retrieval from Synthesis

The bank initially considered a single system that would take a query, retrieve relevant documents, and generate a summary in one pass. This approach is simpler to build but harder to debug and optimize. When the summary is wrong, you cannot tell whether the problem is in the retrieval or in the synthesis.

The architecture separates retrieval from synthesis. A query goes to the retrieval layer, which returns a ranked list of relevant context. The synthesis layer takes that context and generates a response. Between them, there is a visible boundary.

Analysts see what was retrieved and can adjust their query if the retrieved context is incomplete. This transparency builds trust. Analysts who can see that the system retrieved the right documents but generated a wrong summary know exactly where to direct feedback. When the system retrieves wrong documents, they know the synthesis layer is not to blame.

Separation also makes it possible to optimize retrieval and synthesis independently. The retrieval layer can be improved without changing the synthesis model. A new synthesis approach can be tested against the same retrieval layer.

This separation became important when the bank later switched synthesis models to a newer version. Because the retrieval interface was stable, the switch took days instead of months. The retrieval layer had not changed, so the bank knew any new issues were in the synthesis layer.

Hybrid Retrieval Pipeline

No single retrieval mechanism handles all their data types. We built a hybrid system that routes queries to the appropriate mechanism.

Vector search handles semantic queries. An analyst asking “how has this company’s competitive position evolved” gets relevant prose from multiple sources even if the exact words do not appear. The competitive position analysis from this year’s research memo, relevant passages from the competitor’s filing, context from analyst notes about market share, these all come back because vector search understands that competitive position, market share, and competitive dynamics are related.

Knowledge graph handles relationship queries. Asking “what companies in this sector have similar capital structures” traverses structured relationships in the investment database. The knowledge graph knows that companies are tagged by sector, that capital structure is a measurable property, and that similarity can be computed. The answer comes back from graph traversal, not from semantic search.

Structured queries handle factual lookup. Asking “what was the revenue growth last quarter” queries the time-series database directly. The model translates the natural language question into a structured query, executes it, and returns the number with proper attribution. This is not a retrieval problem. It is a database lookup.

Results from all three mechanisms feed into a reranking stage that considers relevance, recency, and source authority. A recent research memo ranks higher than an old one for most queries. An official SEC filing ranks higher than an informal analyst note for factual questions.

The reranking stage was tuned based on analyst feedback. Initially, the system weighted recency too heavily. Analysts wanted older authoritative research to appear when the question called for historical context. After adjusting the weights based on feedback, retrieval quality improved measurably.

Ground Truth Anchoring

Financial analysis must be accurate. A hallucinated fact in an investment memo could cause real financial harm. An analyst who acts on false information makes bad decisions.

We built ground truth anchoring into the system. Every factual claim in a generated summary links back to its source. The summary might say “revenue grew 12% year-over-year.” The citation shows that this claim came from the latest 10-Q filing, from the revenue section. Analysts can verify claims by clicking through to the underlying document.

This required building citation extraction into the synthesis stage and linking to the retrieval system. When the synthesis layer generates text, it tracks which retrieved context supported each claim. The citation extraction identifies the specific source and location that supports the claim.

It added significant complexity. The synthesis model needed to be prompted to generate citations. The retrieval layer needed to return sources with enough context for citation to work. The presentation layer needed to render citations in a way that analysts found useful.

It also made the system trustworthy enough that analysts actually used it. Without ground truth anchoring, analysts spent as much time verifying AI outputs as they would have spent doing the research manually. With anchoring, they could verify outputs quickly and trust the ones that checked out.

The citation accuracy rate of 94% was acceptable for production. Most errors were in edge cases involving compound queries, where a single question combined multiple sub-questions, or ambiguous references, where a phrase could refer to multiple entities. We built workarounds for common edge cases, reducing the error rate over time.

Data Ingestion Pipeline

Different sources required different processing. This was where the most unexpected work lived.

SEC filings went through XBRL parsing and section extraction. The XBRL tags provide structure, but we needed to extract both the tagged data and the prose sections. The filing parser identified sections by type: risk factors, management discussion, financial statements. Each section went into the appropriate retrieval mechanism. Tagged data went to structured storage. Prose sections went to the vector store with metadata about the filing and section type.

The XBRL parsing proved more complex than expected. The SEC’s XBRL taxonomy evolves, and filings sometimes use non-standard tags. We built validation that flagged filings with unusual tags for manual review. This caught errors in both our parsing and in some filings themselves, where companies had tagged data incorrectly.

Research memos went through document chunking with preservation of document structure. A research memo has sections, subsections, and sometimes appendices. Chunking by fixed size would have split sections arbitrarily. We chunked by document structure, keeping paragraphs and sections intact. The chunk metadata included the document ID, section type, and position in document.

Market data loaded into a specialized time-series store with financial query capabilities. This was not a document problem at all. The data came from a market data vendor in a standard financial format. We built a connector that ingested the feed and stored it in a time-series database optimized for financial queries.

Analyst notes were the messiest source. They came in various formats, with varying levels of structure. Some were formal documents. Some were email threads. Some were instant message logs. We applied different extraction strategies based on format, then normalized the extracted content into a common schema.

The knowledge graph was built incrementally. We started with a basic entity graph covering companies, executives, and sectors. As we learned what relationships analysts actually queried, we expanded. Queries about capital structure came early. Queries about supply chain relationships came later. The graph grew to match actual usage patterns.

Results and Measurements

Measured over six months of production usage, the outcomes exceeded the original goal.

Time savings were the primary metric. Average research time per analyst decreased from 4.2 hours to 1.1 hours. This is a 73% reduction, beyond the 50% target. The difference between achieved and target came partly from efficiency gains beyond what was anticipated and partly from analysts finding additional uses for the system that further reduced manual research.

Coverage improvement was an unexpected benefit. Analysts reported accessing more information sources per research project. The system made it easier to check secondary sources because retrieval was fast. Previously, checking a secondary source meant another search session. Now it meant another query in the same interface. This led to more comprehensive research, even though that was not the original goal.

User adoption exceeded projections. Initial skepticism was high. Analysts had seen AI demos before and were unimpressed. They expected a toy that would waste their time. After six months, daily active users exceeded initial projections by 40%. The adoption came from the analysts who tried it, found it useful, and told their colleagues.

The adoption pattern was noteworthy. The first cohort of users was small and skeptical. They used the system because their manager asked them to. After a few weeks, they started telling colleagues. The second cohort was larger and less skeptical because they had heard positive reviews from peers. By month three, the system had a waitlist of analysts who wanted access.

Error rate was acceptable for production. Citation accuracy was 94%. Most errors were in edge cases involving compound queries, where a single question combined multiple sub-questions, or ambiguous references, where a phrase could refer to multiple entities. We built workarounds for common edge cases, reducing the error rate over time.

What We Got Wrong

Underestimated ingestion complexity was the first major misstep. Initial estimates for data pipeline development were off by a factor of three. Financial data formats are idiosyncratic and poorly documented. The SEC filing format has evolved over decades and has quirks that are not obvious until you parse thousands of filings. The market data vendor changed their format twice during development without notice.

The lesson: budget for data engineering at least twice what you think it will cost. The technical complexity of parsing diverse financial data formats is systematically underestimated because it is invisible in demos and prototypes.

Overestimated model reliability was the second misstep. Early testing showed impressive results. The synthesis model generated fluent, confident summaries that reviewers rated highly. Production data revealed edge cases that did not appear in testing. A query about a company going through bankruptcy proceedings produced confident nonsense. A query about a highly technical acquisition structure produced plausible but wrong analysis. Ongoing monitoring caught issues that initial testing missed, but the catching came later than it should have.

The lesson: test with production data before production launch. Testing with curated data produces unrepresentative results. Find ways to test with real data, even if that means running the system in shadow mode where it answers queries but the answers are not used.

Ignored change management was the third misstep. Technical success did not guarantee adoption. Some analysts resisted because they felt the system was trying to replace them. They worried that the bank would use the system to eliminate analyst positions. Addressing this required making the system clearly helpful without threatening analyst roles. The system was positioned as giving analysts more time for the high-value analysis work, not as replacing the analysts who did that work.

This resistance came from a real concern. The bank’s leadership had made noises about AI reducing headcount. Analysts had heard these noises. They were protecting their jobs. Only after the bank’s leadership explicitly committed to no AI-driven layoffs did the resistance fade.

The lesson: organizational concerns are as significant as technical challenges. Technical systems exist in organizational contexts. Resistance that seems irrational often has rational origins in fears about job security or organizational trust. Address those fears directly, not just through technical demonstrations.

The Infrastructure Choices That Mattered

Several infrastructure decisions proved more important than expected.

The choice to separate retrieval from synthesis was the most important architectural decision. It made debugging possible. When the synthesis model generated a wrong answer, the retrieval layer could show exactly what context it had provided. This made it possible to fix both retrieval and synthesis problems efficiently.

The choice to build ground truth anchoring from the start was the second most important decision. It would have been easier to defer. Ground truth anchoring added significant development time. But without it, analysts would not have trusted the system enough to use it daily. The citation feature was not a feature. It was the foundation of trust.

The choice to build incremental knowledge graph population was the third most important decision. Trying to build a comprehensive knowledge graph upfront would have taken years and would have required understanding queries that nobody had asked yet. Instead, we started with basic entities and expanded based on actual query patterns. The knowledge graph that existed after twelve months was one we would have built differently if we had tried to design it upfront, but it was the knowledge graph that matched how the system was actually used.

Lessons for Similar Projects

The bank project offers several lessons for organizations attempting similar work.

First, budget for data engineering at least twice what you think. The work of getting data from source systems into AI-ready form is consistently underestimated because it is invisible in demos and prototypes.

Second, separate retrieval from synthesis. The ability to debug is foundational. Without it, you cannot improve the system systematically.

Third, build trust features from the start. Ground truth anchoring, citation links, source attribution. These are not nice-to-haves. They are what make analysts confident enough to use the system for real work.

Fourth, involve analysts early and often. Their feedback shaped every aspect of the system. Without their involvement, we would have built something that impressed us but that they would not have used.

Fifth, address organizational concerns directly. The change management challenge was as significant as the technical challenge. Technical success does not guarantee adoption.

The underlying lesson: production AI in financial services requires more robustness than proof-of-concept AI. The edge cases that are rare in testing are not rare enough in production when the stakes are high. Ground truth anchoring, citation accuracy, and error monitoring are not optional. They are the difference between a system analysts trust and a system they work around.

Shipping a production AI system?

Find the control gaps before they turn into incidents. Take the AI Production Scorecard for a fast baseline across the seven layers, or book an architecture review and we will turn it into a hardening plan.

Take the AI Production Scorecard Book an Architecture Review

This comment section requires JavaScript.

Enable JavaScript in your browser to use this feature.

Similar Articles

Case Study AI Infrastructure

How a retailer reduced inference latency 90% with feature store caching

21 Apr, 2026 | 04 Mins read

A mid-market e-commerce retailer with roughly $200M in annual revenue had invested eighteen months building a product recommendation engine. The models were accurate. Offline evaluation showed meaning

Case Study Data Architecture

The data pipeline that cost $50K/month — and the audit that found why

22 Apr, 2026 | 04 Mins read

A financial services firm running analytics on trade settlement data came to us with a specific complaint: their cloud data platform cost had tripled in eighteen months, and nobody could explain why.

Case Study Data Architecture

Migrating from batch to streaming: a 6-month journey

28 Apr, 2026 | 05 Mins read

A logistics company processing two million shipments per day ran their entire operational reporting stack on nightly batch ETL. Every morning at 6 AM, operations managers reviewed dashboards built on

Case Study Knowledge Layer

When RAG failed: a knowledge retrieval project post-mortem

29 Apr, 2026 | 05 Mins read

A legal technology company had invested six months building a retrieval-augmented generation system to help contract attorneys find relevant precedent clauses across a corpus of 180,000 executed agree

Case Study Data Architecture

From 3-hour dashboards to 3-minute insights: a BI modernization story

05 May, 2026 | 05 Mins read

A manufacturing company with facilities in twelve countries ran its operational reporting on a traditional BI stack: a data warehouse, an ETL pipeline, and a dashboard tool that had been deployed six

Case Study AI Infrastructure

The vector database that couldn't scale — and what we did instead

12 May, 2026 | 05 Mins read

A media company with a library of twelve million articles, transcripts, and research documents had built a semantic search system on a managed vector database. The system was designed to let journalis

Case Study AI Infrastructure

Building an AI operating system for a 10,000-person company

19 May, 2026 | 05 Mins read

A diversified industrial company with 10,000 employees across manufacturing, logistics, and field services had accumulated forty-seven separate AI projects over three years. Each business unit had bui

Case Study Data Architecture

How we killed our ETL pipeline (and productivity went up)

26 May, 2026 | 05 Mins read

A B2B SaaS company running a customer success platform had a data pipeline that consumed sixty percent of the data engineering team's time. Not feature work. Not analytics. Pipeline maintenance. The p

Case Study AI Governance

A compliance-first AI rollout in financial services

03 Jun, 2026 | 05 Mins read

A regional bank with $12 billion in assets wanted to use machine learning to improve its commercial loan underwriting process. The existing process was manual, relying on credit analysts who spent fou

Case Study MLOps

The $2M model that never made it to production

09 Jun, 2026 | 05 Mins read

A retail chain with 400 stores spent two years and $2.1 million building an inventory optimization model. The model was technically excellent. It reduced predicted stockouts by thirty-two percent and

Case Study Data Architecture

Data mesh in practice: year 2 retrospective

16 Jun, 2026 | 05 Mins read

An insurance company with $400 million in premium volume adopted data mesh two years ago. The central data team had become a bottleneck. Every business unit — claims, underwriting, actuarial, and dist

Case Study AI Infrastructure

When your AI vendor goes bankrupt — surviving platform lock-in

23 Jun, 2026 | 05 Mins read

A healthcare analytics company received notice on a Tuesday afternoon that their primary AI infrastructure vendor was filing for Chapter 7 bankruptcy. The platform hosted their patient risk stratifica

Case Study AI Infrastructure

Real-time fraud detection: from proof-of-concept to production in 90 days

30 Jun, 2026 | 05 Mins read

A payment processor handling twelve million transactions per day had a fraud detection system that was accurate but slow. The system reviewed transactions in batch, four times per day. A fraudulent tr

Case Study Knowledge Layer

Consolidating 47 data sources into one knowledge layer

01 Jul, 2026 | 05 Mins read

A global professional services firm with 8,000 consultants maintained institutional knowledge across forty-seven separate systems. Project proposals lived in a document management system. Client engag

Case Study AI Governance

The GDPR audit that reshaped our entire ML pipeline

07 Jul, 2026 | 05 Mins read

A European fintech with twelve million customers received a GDPR audit notice from their national data protection authority. The audit focused on the company's machine learning pipeline, which powered

Case Study AI Governance

How a healthcare org deployed LLMs without violating HIPAA

14 Jul, 2026 | 05 Mins read

A hospital system with twelve facilities and 14,000 clinical staff wanted to use large language models to assist with clinical documentation. Physicians spent an average of two hours per day on docume

Case Study RAG

Case Study: End-to-End RAG Platform for Customer Support

05 Dec, 2025 | 05 Mins read

A SaaS company with 200 support agents and 10,000+ knowledge base articles had an 18-hour average response time and 23% first-contact resolution. Their largest enterprise client threatened to cancel a

Knowledge Layer AI Infrastructure

Designing the Enterprise Knowledge Layer: Beyond RAG

16 Jan, 2026 | 14 Mins read

Most teams implement retrieval-augmented generation and call it a knowledge layer. Give the model access to a vector database, stuff in some documents, and ship. This approach works for demos. It fall

Agent Orchestration Case Study

Case Study: Multi-Agent System for Supply Chain Optimization

13 Jun, 2026 | 12 Mins read

A mid-size automotive parts manufacturer with operations spanning 15 countries and relationships with over 200 suppliers faced a supply chain coordination problem that was consuming too much of their

Knowledge Layer Knowledge Graphs

Knowledge Graphs and Vector Search: Complementary, Not Competitive

19 Apr, 2026 | 11 Mins read

The framing of knowledge graphs versus vector databases as competing technologies is a symptom of hype cycles that simplify complex architectural decisions for public discourse. Practitioners argue ab