A regional bank’s investment research team spent 60% of their time gathering information and 40% doing analysis. Analysts had to search through regulatory filings, internal research memos, market data feeds, and analyst notes to build a picture of any investment opportunity. By the time they had gathered everything, they had little time left to think.
The bank wanted AI to shift this balance. Give analysts a single interface that could query across all their information sources, surface relevant context, and summarize key findings. The goal was to reduce research time by half.
Twelve months later, they had a system in production. The actual outcome was 73% reduction in research time, measured across six months of production usage.
The Data Landscape
Financial services data is heterogeneous in the worst way. Understanding the data landscape upfront shaped every architectural decision.
SEC filings are structured documents with XBRL tags, but the actual investment analysis lives in the prose. The XBRL tags provide standardized financial metrics, but the narrative, the risk factors, the management discussion, these are in prose sections that require natural language understanding. Filings update quarterly with earnings, annually with 10-Ks, and ad-hoc for material events. The cadence is predictable but the volume is large.
Research memos are pure prose. Investment thesis, risk analysis, peer comparisons. Written by analysts with varying styles, quality, and focus. Some are comprehensive documents running dozens of pages. Some are quick takes written in fifteen minutes. The quality variance is significant, and the knowledge layer needed to surface that variance without making judgments it could not support.
Market data is numerical and time-series. Pricing, volume, fundamentals, estimates. Updates continuously during trading hours. Requires specialized queries that text-based retrieval cannot handle. A question about revenue growth is not a semantic search problem. It is a time-series lookup problem that requires accessing structured financial data.
Analyst notes are a free-form mixture of observations, calculations, and informal commentary. Often the most valuable source for understanding why something matters, but least structured. An analyst’s note might say “mgmt guided down on margins, not surprised given commodity costs, but the tone was more cautious than last quarter.” Extracting that signal requires understanding tone, context, and implication.
Architecture Decisions
Separate Retrieval from Synthesis
The bank initially considered a single system that would take a query, retrieve relevant documents, and generate a summary in one pass. This approach is simpler to build but harder to debug and optimize. When the summary is wrong, you cannot tell whether the problem is in the retrieval or in the synthesis.
The architecture separates retrieval from synthesis. A query goes to the retrieval layer, which returns a ranked list of relevant context. The synthesis layer takes that context and generates a response. Between them, there is a visible boundary.
Analysts see what was retrieved and can adjust their query if the retrieved context is incomplete. This transparency builds trust. Analysts who can see that the system retrieved the right documents but generated a wrong summary know exactly where to direct feedback. When the system retrieves wrong documents, they know the synthesis layer is not to blame.
Separation also makes it possible to optimize retrieval and synthesis independently. The retrieval layer can be improved without changing the synthesis model. A new synthesis approach can be tested against the same retrieval layer.
This separation became important when the bank later switched synthesis models to a newer version. Because the retrieval interface was stable, the switch took days instead of months. The retrieval layer had not changed, so the bank knew any new issues were in the synthesis layer.
Hybrid Retrieval Pipeline
No single retrieval mechanism handles all their data types. We built a hybrid system that routes queries to the appropriate mechanism.
Vector search handles semantic queries. An analyst asking “how has this company’s competitive position evolved” gets relevant prose from multiple sources even if the exact words do not appear. The competitive position analysis from this year’s research memo, relevant passages from the competitor’s filing, context from analyst notes about market share, these all come back because vector search understands that competitive position, market share, and competitive dynamics are related.
Knowledge graph handles relationship queries. Asking “what companies in this sector have similar capital structures” traverses structured relationships in the investment database. The knowledge graph knows that companies are tagged by sector, that capital structure is a measurable property, and that similarity can be computed. The answer comes back from graph traversal, not from semantic search.
Structured queries handle factual lookup. Asking “what was the revenue growth last quarter” queries the time-series database directly. The model translates the natural language question into a structured query, executes it, and returns the number with proper attribution. This is not a retrieval problem. It is a database lookup.
Results from all three mechanisms feed into a reranking stage that considers relevance, recency, and source authority. A recent research memo ranks higher than an old one for most queries. An official SEC filing ranks higher than an informal analyst note for factual questions.
The reranking stage was tuned based on analyst feedback. Initially, the system weighted recency too heavily. Analysts wanted older authoritative research to appear when the question called for historical context. After adjusting the weights based on feedback, retrieval quality improved measurably.
Ground Truth Anchoring
Financial analysis must be accurate. A hallucinated fact in an investment memo could cause real financial harm. An analyst who acts on false information makes bad decisions.
We built ground truth anchoring into the system. Every factual claim in a generated summary links back to its source. The summary might say “revenue grew 12% year-over-year.” The citation shows that this claim came from the latest 10-Q filing, from the revenue section. Analysts can verify claims by clicking through to the underlying document.
This required building citation extraction into the synthesis stage and linking to the retrieval system. When the synthesis layer generates text, it tracks which retrieved context supported each claim. The citation extraction identifies the specific source and location that supports the claim.
It added significant complexity. The synthesis model needed to be prompted to generate citations. The retrieval layer needed to return sources with enough context for citation to work. The presentation layer needed to render citations in a way that analysts found useful.
It also made the system trustworthy enough that analysts actually used it. Without ground truth anchoring, analysts spent as much time verifying AI outputs as they would have spent doing the research manually. With anchoring, they could verify outputs quickly and trust the ones that checked out.
The citation accuracy rate of 94% was acceptable for production. Most errors were in edge cases involving compound queries, where a single question combined multiple sub-questions, or ambiguous references, where a phrase could refer to multiple entities. We built workarounds for common edge cases, reducing the error rate over time.
Data Ingestion Pipeline
Different sources required different processing. This was where the most unexpected work lived.
SEC filings went through XBRL parsing and section extraction. The XBRL tags provide structure, but we needed to extract both the tagged data and the prose sections. The filing parser identified sections by type: risk factors, management discussion, financial statements. Each section went into the appropriate retrieval mechanism. Tagged data went to structured storage. Prose sections went to the vector store with metadata about the filing and section type.
The XBRL parsing proved more complex than expected. The SEC’s XBRL taxonomy evolves, and filings sometimes use non-standard tags. We built validation that flagged filings with unusual tags for manual review. This caught errors in both our parsing and in some filings themselves, where companies had tagged data incorrectly.
Research memos went through document chunking with preservation of document structure. A research memo has sections, subsections, and sometimes appendices. Chunking by fixed size would have split sections arbitrarily. We chunked by document structure, keeping paragraphs and sections intact. The chunk metadata included the document ID, section type, and position in document.
Market data loaded into a specialized time-series store with financial query capabilities. This was not a document problem at all. The data came from a market data vendor in a standard financial format. We built a connector that ingested the feed and stored it in a time-series database optimized for financial queries.
Analyst notes were the messiest source. They came in various formats, with varying levels of structure. Some were formal documents. Some were email threads. Some were instant message logs. We applied different extraction strategies based on format, then normalized the extracted content into a common schema.
The knowledge graph was built incrementally. We started with a basic entity graph covering companies, executives, and sectors. As we learned what relationships analysts actually queried, we expanded. Queries about capital structure came early. Queries about supply chain relationships came later. The graph grew to match actual usage patterns.
Results and Measurements
Measured over six months of production usage, the outcomes exceeded the original goal.
Time savings were the primary metric. Average research time per analyst decreased from 4.2 hours to 1.1 hours. This is a 73% reduction, beyond the 50% target. The difference between achieved and target came partly from efficiency gains beyond what was anticipated and partly from analysts finding additional uses for the system that further reduced manual research.
Coverage improvement was an unexpected benefit. Analysts reported accessing more information sources per research project. The system made it easier to check secondary sources because retrieval was fast. Previously, checking a secondary source meant another search session. Now it meant another query in the same interface. This led to more comprehensive research, even though that was not the original goal.
User adoption exceeded projections. Initial skepticism was high. Analysts had seen AI demos before and were unimpressed. They expected a toy that would waste their time. After six months, daily active users exceeded initial projections by 40%. The adoption came from the analysts who tried it, found it useful, and told their colleagues.
The adoption pattern was noteworthy. The first cohort of users was small and skeptical. They used the system because their manager asked them to. After a few weeks, they started telling colleagues. The second cohort was larger and less skeptical because they had heard positive reviews from peers. By month three, the system had a waitlist of analysts who wanted access.
Error rate was acceptable for production. Citation accuracy was 94%. Most errors were in edge cases involving compound queries, where a single question combined multiple sub-questions, or ambiguous references, where a phrase could refer to multiple entities. We built workarounds for common edge cases, reducing the error rate over time.
What We Got Wrong
Underestimated ingestion complexity was the first major misstep. Initial estimates for data pipeline development were off by a factor of three. Financial data formats are idiosyncratic and poorly documented. The SEC filing format has evolved over decades and has quirks that are not obvious until you parse thousands of filings. The market data vendor changed their format twice during development without notice.
The lesson: budget for data engineering at least twice what you think it will cost. The technical complexity of parsing diverse financial data formats is systematically underestimated because it is invisible in demos and prototypes.
Overestimated model reliability was the second misstep. Early testing showed impressive results. The synthesis model generated fluent, confident summaries that reviewers rated highly. Production data revealed edge cases that did not appear in testing. A query about a company going through bankruptcy proceedings produced confident nonsense. A query about a highly technical acquisition structure produced plausible but wrong analysis. Ongoing monitoring caught issues that initial testing missed, but the catching came later than it should have.
The lesson: test with production data before production launch. Testing with curated data produces unrepresentative results. Find ways to test with real data, even if that means running the system in shadow mode where it answers queries but the answers are not used.
Ignored change management was the third misstep. Technical success did not guarantee adoption. Some analysts resisted because they felt the system was trying to replace them. They worried that the bank would use the system to eliminate analyst positions. Addressing this required making the system clearly helpful without threatening analyst roles. The system was positioned as giving analysts more time for the high-value analysis work, not as replacing the analysts who did that work.
This resistance came from a real concern. The bank’s leadership had made noises about AI reducing headcount. Analysts had heard these noises. They were protecting their jobs. Only after the bank’s leadership explicitly committed to no AI-driven layoffs did the resistance fade.
The lesson: organizational concerns are as significant as technical challenges. Technical systems exist in organizational contexts. Resistance that seems irrational often has rational origins in fears about job security or organizational trust. Address those fears directly, not just through technical demonstrations.
The Infrastructure Choices That Mattered
Several infrastructure decisions proved more important than expected.
The choice to separate retrieval from synthesis was the most important architectural decision. It made debugging possible. When the synthesis model generated a wrong answer, the retrieval layer could show exactly what context it had provided. This made it possible to fix both retrieval and synthesis problems efficiently.
The choice to build ground truth anchoring from the start was the second most important decision. It would have been easier to defer. Ground truth anchoring added significant development time. But without it, analysts would not have trusted the system enough to use it daily. The citation feature was not a feature. It was the foundation of trust.
The choice to build incremental knowledge graph population was the third most important decision. Trying to build a comprehensive knowledge graph upfront would have taken years and would have required understanding queries that nobody had asked yet. Instead, we started with basic entities and expanded based on actual query patterns. The knowledge graph that existed after twelve months was one we would have built differently if we had tried to design it upfront, but it was the knowledge graph that matched how the system was actually used.
Lessons for Similar Projects
The bank project offers several lessons for organizations attempting similar work.
First, budget for data engineering at least twice what you think. The work of getting data from source systems into AI-ready form is consistently underestimated because it is invisible in demos and prototypes.
Second, separate retrieval from synthesis. The ability to debug is foundational. Without it, you cannot improve the system systematically.
Third, build trust features from the start. Ground truth anchoring, citation links, source attribution. These are not nice-to-haves. They are what make analysts confident enough to use the system for real work.
Fourth, involve analysts early and often. Their feedback shaped every aspect of the system. Without their involvement, we would have built something that impressed us but that they would not have used.
Fifth, address organizational concerns directly. The change management challenge was as significant as the technical challenge. Technical success does not guarantee adoption.
The underlying lesson: production AI in financial services requires more robustness than proof-of-concept AI. The edge cases that are rare in testing are not rare enough in production when the stakes are high. Ground truth anchoring, citation accuracy, and error monitoring are not optional. They are the difference between a system analysts trust and a system they work around.