Four runners, one baton, four legs of a relay race. Runner A sprints the first leg, hands to Runner B, who sprints the second, hands to C, who hands to D, who crosses the finish line. None of them runs the whole track alone. The baton is the intermediate output. Each runner does their specific segment with context from what came before. The relay only works if each handoff is clean. If Runner B drops the baton, the race is over regardless of how well A, C, and D ran.
Prompt chaining applies the relay to language model tasks. Instead of one model call that does everything, you chain multiple model calls where each call handles one step and passes its output to the next. The output of step one becomes part of the input to step two. The baton format is the interface contract between stages. If the baton is ambiguous, the next runner stumbles.
Why Break the Race Into Legs
A single model call processing a complex task has to hold the whole problem in its context. It reasons, it retrieves, it formats, it filters: all at once. Breaking this into stages lets each stage focus. A classification step knows only that it needs to categorize the input. A retrieval step knows only that it needs documents relevant to a given category. A synthesis step knows only that it needs to write an answer from the retrieved documents. Each stage does one thing, which means each stage can be optimized for that one thing. The relay race is faster than one runner doing all four legs because specialization pays.
Breaking the task into stages also lets you validate between steps. If step one produces malformed output, you catch it before it propagates downstream. A classification step that returns an unrecognized category can be flagged and handled before the retrieval step tries to use it. A retrieval step that returns no results can trigger a fallback before the synthesis step produces a response without grounding. The chain breaks cleanly at the point of failure rather than producing a wrong answer with no traceable origin. In the relay race, if Runner A hands off a deformed baton, Runner B can refuse it. In a monolithic prompt, the model produces garbage and you do not know whether the problem was in the reasoning or in the grounding.
This diagram requires JavaScript.
Enable JavaScript in your browser to use this feature.
Chaining also lets you use different models for different steps. A lightweight model might handle classification or routing. A heavier model handles the generation. A specialized model handles a domain-specific step. Each leg of the race goes to the runner best suited for that distance. A cheap fast model for classification earns its keep when the classification step does not need GPT-4-class reasoning. The heavier model is reserved for the step that actually needs it. The cost savings compound across high-volume workloads: cheap classification on thousands of queries, expensive synthesis only on the queries that passed classification.
Parallel development is another benefit. Each stage can be developed and tested independently. A team can work on the retrieval stage while another team works on the synthesis stage, with the interface contract (the baton format) defined upfront. This separation reduces integration complexity and enables independent iteration. You can swap the retrieval model without changing synthesis. You can add a new routing stage without touching the existing stages.
The Costs of the Handoff
Every handoff costs time. The baton must be passed cleanly: if the output of step one is ambiguous, step two starts from a bad place. If step one loses critical context, step two cannot recover it. The chain is only as strong as its weakest link. A classification step that collapses multiple distinct categories into one will misroute downstream processing in ways that the downstream cannot detect. Runner B runs fast but with the wrong baton; Runner C cannot compensate.
The baton-passing problem is worse than it first appears. A model that is excellent at a task in isolation is not necessarily excellent when its input is the output of a previous model. The format, the level of detail, the implicit assumptions: all of these may not transfer cleanly between stages. Building a chain means designing the baton format explicitly, not assuming that natural language output from step one will be perfectly parseable by step two. The format is the API contract, and like all API contracts, it needs design time attention.
A structured baton format reduces handoff ambiguity. Rather than passing natural language prose between stages, pass structured data: JSON with explicit fields, classification labels, confidence scores. This gives the next stage clear inputs it can parse reliably. The cost is losing the nuance that natural language carries, but for stage-to-stage handoff, reliability matters more than nuance. The baton format is not the place for eloquence; it is the place for precision.
Debugging is harder in chains. An error in a chained system requires tracing the baton through each stage. Which step introduced the bad output? Was it the input to that step, or the step itself? You need observability across the full chain. Logging the input and output of each stage is not optional; it is how you debug failures. Without that visibility, you are guessing which runner dropped the baton. Instrument each handoff. Store each baton. The debugging cost of a chain is paid in observability infrastructure; do not cheap out here.
Error propagation in chains is nontrivial. A failure in step three may manifest as a bad output that looks like a step three failure but originated in step one providing poor context. Without tracing the baton through each stage, you will fix step three while the real problem remains in step one. The symptom is in step three; the cause is in step one. This is the classic distributed systems debugging problem, and chains exhibit it in full.
Parallelism Lost
A single prompt can often do several things at once: extract multiple facts, answer multiple questions, process multiple documents. Chain that and you serialize what could have been parallel. If your task involves analyzing five documents, a single model call with all five documents can reason across them together. A chain that processes them one by one cannot capture cross-document relationships unless there is a stage explicitly designed to merge insights.
The chain trades throughput for focus and control. If your workload is high-volume and the steps are simple, the parallel call is faster. If your workload is complex and the steps genuinely need to be sequential, the chain is worth the latency. The decision should be driven by whether the steps have real dependencies. If they do, the chain is correct by definition. If they do not, consider whether the chain is adding latency for no reason. The relay race is slower than four runners going simultaneously; it is only worth it when the handoff is necessary.
A parallel approach that processes all documents in one call also preserves context across documents. A chain that processes documents sequentially loses the ability to compare, contrast, and synthesize across the full set until possibly a final stage. If cross-document synthesis is important, forcing a chain may degrade output quality. The synthesis stage sees only what the retrieval stage passed; if retrieval filtered out something important, synthesis cannot recover it.
Some chains can be partially parallelized. If step two and step three both take step one output as input and have no dependency on each other, they can run in parallel. This requires designing the chain with parallelism in mind, not just assuming sequential is the only structure. The dependency graph matters. Draw it before you code it.
When the Chain Is the Right Design
The chain is correct when the steps are genuinely sequential dependencies. Contract review is a good example: extract the parties, extract the key clauses, extract the risks, summarize the risks. Each step depends on the output of the previous step. The legal team needs the clause extraction before the risk assessment. The summary needs the risk assessment. The order is real. You cannot assess risks before you know what clauses exist; you cannot summarize risks before you have assessed them.
The chain is also correct when you need intervention points. If human review belongs between the retrieval and the synthesis, the chain makes that explicit. A stage-based architecture lets you insert a human task at a defined point. A monolithic prompt does not. The chain is a workflow; the workflow may include human steps. The chain makes those steps visible and explicitly placed.
The chain is also correct when you need to route to different downstream paths based on intermediate output. A classification step that decides between three categories can route to category-specific synthesis prompts. This conditional routing is cleaner in a chain architecture than in a single prompt trying to handle all cases. The baton tells you which path to take; a single prompt must decide internally.
The chain is wrong when steps are parallelizable and the throughput matters more than the validation between stages. For a high-volume FAQ system where every query is independent and speed matters, a single model call is probably the right choice. The relay is not always faster than one runner doing all the work. It is faster only when the handoff adds value.
Chain Design Principles
Good chain design starts with clear stage boundaries. Each stage should have a single, well-defined purpose. If a stage is doing classification and extraction simultaneously, consider splitting it. Single-purpose stages are easier to test, debug, and replace. A stage that does two things will be harder to test than two stages that each do one thing.
The baton format between stages deserves as much design attention as the prompts themselves. A structured format with explicit fields reduces ambiguity. Including the original input at each stage (not just the previous stage output) prevents context loss through the chain. If stage three needs context from stage one, and stage two did not pass it through, stage three is missing information. The baton should carry what each stage needs, not just what the previous stage produced.
Build each stage to fail visibly. If a stage cannot produce valid output, it should fail explicitly rather than passing garbage downstream. A chain that fails loudly at the problem stage is better than a chain that produces wrong output from a silent failure. The baton format should include a status field: success, failure, or ambiguous. Downstream stages should handle failure status explicitly.
Stage ordering should reflect real dependencies, not just conceptual ones. If step B does not actually need step A’s output to function, consider whether B should come before A. Dependency creates ordering; convention does not. Most chains are sequential because the designers thought about the logical flow, not because the dependencies actually require it. Audit the dependency graph before committing to the sequence.
Use prompt chaining when the task genuinely decomposes into distinct stages with real dependencies between them, when you need validation between steps before proceeding, when different stages benefit from different models, when you need to inject intermediate results or external data between steps, when debugging which step failed matters for your use case, when human review or intervention needs a defined insertion point, and when the task order reflects a genuine workflow.
Use a single prompt when the task is simple enough for one call, when latency is critical and the steps could run in parallel, when the context required fits comfortably in the window, when adding chain complexity does not buy enough clarity or reliability, when cross-document or cross-query synthesis requires joint context, and when throughput is more important than stage-level observability.
The baton passes. Each runner trusts the previous one left the baton in the right hand. Build your chain accordingly, and instrument each handoff. The relay is only as strong as its weakest runner and its sloppiest handoff.