A language model that only generates text is not enough for most enterprise problems. The real value emerges when an AI system can look up your customer record, check inventory levels across warehouses, update a support ticket status, or trigger a workflow in your ERP system. These actions require the model to interact with external systems that hold authoritative data and can perform state changes. Without that capability, you have a sophisticated text generator. With it, you have a system that can actually accomplish work.
The gap between a model that reasons and a model that acts is where most of the engineering complexity lives in production AI systems. Reasoning is abstract. Action requires interfacing with systems that have their own schemas, error modes, authentication requirements, and operational rhythms. A model that can explain your return policy is useful. A model that can look up a specific customer’s return eligibility, check what items they have returned in the past, and initiate a return workflow is transformative.
But bridging that gap is not as simple as connecting the model to an API. Enterprise systems are messy. They have inconsistent interfaces. They change without warning. They fail in ways that models do not handle well. And they create security boundaries that models do not understand. Building a production tool calling system requires solving problems that the model documentation does not address.
Tool calling is the mechanism that makes this possible. It bridges the gap between the model’s reasoning capabilities and the operational systems that manage real business state. But the terminology in this space is messy, with vendors layering marketing language over a relatively small set of underlying concepts. The approaches differ in ways that have significant implications for enterprise architects who are trying to design reliable, maintainable systems.
This is not a fundamentally new problem in software engineering. Systems have always needed to bridge the gap between abstract reasoning and concrete state changes. What is new is that the model can now decide when and how to make those state changes autonomously, which shifts where you must place your validation logic, authorization checks, and error handling. In traditional software, the programmer decides when to call an API. With tool calling, the model decides, which means you must validate the model’s decisions at runtime rather than design time.
What the Terms Actually Mean
The AI industry has produced a confusing array of terminology for what is essentially the same underlying capability. Understanding what each term actually refers to requires stripping away the vendor marketing and examining the mechanisms underneath.
Function calling is OpenAI’s terminology for this capability. The pattern works like this: you provide the model with a schema that describes one or more functions, including their names, parameter structures, and expected behaviors. During a conversation, the model decides when to call a function and returns structured arguments that specify which function to call and what values to pass. Your application code receives this structured request, executes the actual function, and returns the result to the model, which incorporates it into its response.
The critical distinction is that the model does not execute code. It produces structured data that your application interprets and acts upon. When the model returns a structured request to check inventory for SKU 12345, your application is responsible for actually querying the inventory system. The model recommends; your code acts. This separation is fundamental to how tool calling works and to the security model you must build around it.
The function calling interface is defined by a schema you provide. The schema describes the function name, its parameters, and their types. When the model decides to call the function, it returns a JSON object with the parameter values. Your code then calls the actual function and returns the result. The model never sees your code. It never executes database queries or API calls. It only produces structured data that your application acts on.
This matters for security because the attack surface is different from traditional code execution. A SQL injection attack exploits a code execution path. A function calling attack exploits the model’s ability to generate structured requests. If a user can manipulate the model into generating function calls it should not make, they can potentially access data or trigger actions that the model’s function calling interface exposes. Your security model must account for this.
Tool use is Anthropic’s equivalent terminology. The semantic behavior is identical: the model returns a tool use request with a function name and input arguments, your code handles execution, and the output returns to the model. The difference is primarily in the output format and some nuanced aspects of how multi-step tool use chains are handled. For most architectural purposes, the distinction is negligible.
Anthropic’s tool use format includes a name field and an input field, similar to OpenAI’s function calling format. The conceptual model is the same: the model reasons about when to call a tool and produces structured arguments, your code executes the tool and returns results. The differences are in the specifics of how the structured data is formatted and how the model is instructed about available tools.
Model Context Protocol (MCP) represents a newer approach that attempts to standardize how AI systems discover and interact with tools. Rather than hardcoding function schemas into every conversation, MCP allows models to query a manifest of available tools and their interfaces at runtime. This creates a more dynamic model where the model can browse available capabilities, understand what tools exist without having them explicitly enumerated in the prompt, and potentially use tools that were not anticipated at design time.
MCP shifts the architecture from static schema definition to dynamic discovery. In the traditional approach, you enumerate all tools and their schemas at the start of the conversation. The model decides which to call from that known set. In MCP, the model queries a registry to discover what tools are available, learns their interfaces, and can use tools you did not explicitly tell it about at the start of the conversation.
This has implications for both capability and governance. The model can adapt to new tools without requiring you to update the conversation context. But you lose the ability to audit exactly what tools the model knows about at any given moment. The registry becomes a critical component that you must secure and monitor. An insecure registry is a pathway for unauthorized tool access.
The Architecture of the Routing Layer
The model deciding to call a tool is the easy part. The routing layer that handles tool call requests is where the actual engineering complexity lives. You need to validate tool call requests, enforce permissions, handle timeouts, manage retries, log interactions for audit purposes, and do all of this with latency low enough that the overall interaction still feels responsive to users.
Consider what actually happens when a tool call request arrives at your routing layer. The request contains a function name, arguments, a trace ID for correlation, and a user context. Your routing layer must verify that the user making the request has permission to perform that action. This is not trivial: the model is acting on behalf of a user, but the tool call request arrives as a structured payload, not as an authenticated API call with the user’s credentials attached.
The user context in an AI session is different from the user context in a traditional API call. In a traditional API, the user authenticates directly and their credentials travel with the request. In an AI session, the user authenticates to start the session, but subsequent tool calls are generated by the model based on the conversation. The tool call request does not carry the user’s API credentials. It carries a conversation ID and the model’s generated arguments.
Your routing layer must bridge this gap. It must take the conversation ID, look up the user context, determine what permissions that user has, and validate the tool call against those permissions. This is additional work that traditional API gateways do not need to do because the authentication is built into the request.
You must also verify that the arguments are within expected bounds. If the model recommends deleting a customer record, you need to verify that the record ID provided actually belongs to the customer whose data is being deleted. The model may have hallucinated a record ID or may have been manipulated by a user into generating a delete request for the wrong record. Input validation is your defense.
You must handle the case where the external service is slow, down entirely, or returning unexpected data formats. A tool call that times out is not a failure you can ignore. The model has made a decision based on the assumption that the tool would respond. When it does not, the model needs to handle that gracefully, which means your routing layer must provide error responses that the model can reason about. “Connection timed out after 30 seconds” is a more useful error for the model than a raw exception string.
You must log the interaction for audit purposes in ways that satisfy regulatory requirements. When a model deletes a customer record, you need to log who requested it, what conversation led to it, what the model’s reasoning was, and what the outcome was. This is harder than traditional audit logging because the “who” is indirect: a user had a conversation that led to a model decision that led to an action.
This is ordinary API gateway work, but it sits in a novel position: between an AI system that is making increasingly autonomous decisions and backend systems that expect those decisions to be validated. The model might recommend an action. The routing layer must authorize it. The authorization logic cannot be an afterthought.
The routing layer also needs to handle the unexpected in systematic ways. What happens when the model asks to call a function that does not exist in your registry? Your system should have a defined response rather than silently failing. What happens when arguments are malformed or contain values that would cause the external service to error? You need input validation that catches this before the call executes. What happens when a tool call succeeds but the result is too large to fit back in the context window? You need truncation strategies that preserve the most relevant information.
Failure Modes Worth Anticipating
Production tool calling systems fail in ways that are predictable once you know to look for them. Understanding these failure modes before you deploy saves significant debugging time later.
Illicit tool chaining occurs when a model with access to multiple tools chains them together in ways you did not anticipate or test. A user asks for your Q4 revenue and the model calls your financial API to get revenue figures, then your email system to check who received the financial reports, then your CRM to cross-reference which of those contacts are decision-makers for budget approvals. None of these three calls were intended to work together, but individually each was authorized. The model has effectively created an unauthorized data aggregation pipeline.
The risk here is cross-system data combination. Each tool call is authorized individually. But the model is now combining data from three systems that were not designed to be queried together. The financial API is authorized to return revenue figures. The email system is authorized to show who received reports. The CRM is authorized to show contact information. Individually, these are fine. Together, they might reveal information that violates data governance policies.
Consider implementing call chaining validation that verifies each step in a sequence was authorized independently before permitting the sequence to execute. One automotive parts supplier discovered this failure mode in production. Their AI assistant had access to the inventory system, the supplier database, and the pricing API. A user asked about lead times for a component, and the model decided to check supplier capacity by querying the supplier database, cross-reference that with current inventory levels in the inventory system, and then look up the discount tier for that supplier volume based on order history. Three separate authorized calls, one unauthorized data combination. The routing layer had not been designed to detect cross-system data aggregation because each individual call looked legitimate.
The fix required adding dependency tracking to the routing layer. When multiple tool calls in a session access data from different systems that should not be combined based on data governance rules, the routing layer surfaces this to a human for review. This added latency but prevented the unauthorized aggregation.
This pattern of illicit tool chaining is particularly dangerous because it exploits the model’s ability to reason across domains. The model is not trying to evade controls. It is trying to answer the user’s question as thoroughly as possible. But the user’s question might not have been appropriate, or the model’s reasoning about how to answer it might have combined data in ways that violate policies that the user was unaware of.
Schema drift is an operational risk that accumulates over time. Tool interfaces change. Your CRM vendor updates an API endpoint. Your ERP vendor deprecates a function you rely on. Now your function schemas are outdated and the model is sending requests that will fail at runtime. The model does not know the schema changed. It continues using the pattern that worked before.
Version your schemas systematically and monitor for breakages. Set up automated testing that validates your function schemas against actual API responses, not just against documentation that may be stale. One approach that works: maintain a shadow integration environment where your tool calling system runs against real API endpoints and alerts when responses no longer match expected schemas. This catches schema drift before it affects production users.
Schema drift is particularly insidious because it does not produce errors that are visible to users. The model makes a function call. The function call fails. The model handles the failure gracefully, either by retrying or by apologizing and offering an alternative. The user might not notice. But the tool call is not succeeding, which means the model’s response is not based on the actual data it requested.
Permission confusion is a security design problem that you must resolve explicitly. The model requests a tool on behalf of a user. Does the tool execute with the user’s permissions or with the system’s permissions? This question has real security implications and you need an answer before you deploy, not after.
In most systems, the tool executes with the permissions of the context that initiated the AI session, which means you are trusting your AI session authentication to be appropriate for all possible tool calls. For high-privilege operations like deleting records or accessing sensitive data, this is usually wrong. Design your routing layer to apply the principle of least privilege: the tool should execute with the minimum permissions required for that specific operation, which often means explicitly checking the user’s permissions for each tool call rather than relying on session-level authentication.
One financial services firm we advised handled this by implementing a permission matrix that mapped each tool to the minimum permissions required to execute it. When a tool call request arrived, the routing layer checked whether the user’s role had the required permissions for that specific tool. If not, the call was rejected and logged. This added latency but provided a clear security boundary that satisfied their regulatory requirements.
Silent failures are perhaps the most dangerous failure mode because they are invisible. A tool call succeeds but returns an unexpected format. The model handles it gracefully or hallucinates an answer that fits the unexpected response. You never know the tool failed. The user gets a confident but incorrect answer and assumes the AI system knows what it is talking about.
Instrument your routing layer to detect when tool responses are ignored or contradicted by the model. If a tool returns “no records found” and the model says “I found three matching records,” that is a silent failure you need to know about. Log tool call inputs and outputs and correlate them with downstream quality issues. Build dashboards that show tool call success rates, error rates by tool, and cases where model interpretations of tool responses diverge from what the tool actually returned.
What Each Approach Costs
Function calling simplifies the API surface in some ways. You define schemas, you get structured outputs, you route execution. The model stays focused on language understanding and generation. For teams building on OpenAI or compatible providers, this is a well-trodden path with mature tooling and predictable behavior. The mental model is straightforward and debugging is relatively simple because the interface is explicit.
The costs appear in maintenance burden over time. Every tool change requires schema updates and potential retuning of function call quality. If your CRM vendor updates an API endpoint, you must update your schema, redeploy, and hope the model’s function call quality does not degrade. If you have 50 tools, this maintenance accumulates into a significant ongoing effort. The schema is a contract between your application and the model, and like all contracts, it needs governance.
Schema versioning becomes critical in production. When you change a function’s parameters, old requests might still arrive as the model continues using the old schema until context refreshes. Handle this by supporting backward compatibility in your routing layer or by explicitly invalidating tool definitions when schemas change. A tool call that was valid last week might use parameters that no longer exist this week. Your system should handle this gracefully rather than failing silently.
Security boundaries must be explicit and designed into the routing layer from the start. The model is making a decision to invoke external code, so you need to validate those decisions carefully at runtime. A model that recommends deleting a customer record is not the same as a model that deletes it, but the gap between recommendation and action is exactly where your security controls must live. Do not assume the model’s function calling behavior is inherently safe just because it has been reliable in testing.
MCP addresses the discovery problem but introduces new operational complexity. Now you need a managed tool registry, versioning for tool schemas, and a mechanism for the model to learn what tools exist without you explicitly enumerating them. That registry itself becomes a system you must operate and keep accurate. When a new tool is added to your organization, it must be registered before the model can discover it. When a tool’s interface changes, the registry must be updated. The dynamic discovery model trades explicit control for flexibility, and the flexibility has an operational cost that you must budget for.
When Tool Calling Works Well
Tool calling delivers the most value when you have well-defined, stable interfaces to external systems. Customer records, product catalogs, inventory systems, scheduling tools: these have schemas that do not change frequently and clear inputs and outputs. The model can learn to call them reliably because the patterns are consistent and predictable.
A financial services firm we advised used function calling to let their AI assistant look up account balances, transaction history, and product eligibility. These were stable, well-documented APIs with consistent response formats. The assistant could handle queries like “what is the balance on account 1234” or “show me all transactions over $500 this month” with high reliability. The function schemas did not change more than once or twice a year, and when changes did occur, they were coordinated with the AI team so schemas could be updated proactively.
The stability of the underlying APIs was a prerequisite for the reliability of the function calling. The AI team worked with the API team to ensure that API changes were communicated in advance and that schema updates could be tested before deployment. This coordination is essential for production systems. When API teams treat the AI integration as an afterthought, schema drift accumulates and tool calling reliability suffers.
Tool calling struggles when interfaces are messy, when the same question can be answered by multiple tools with different confidence levels, or when the model needs to combine imprecise information from several sources. A customer service bot that tries to handle policy questions by calling knowledge base retrieval, manual procedure lookup, and contextual memory all at once often produces inconsistent answers because each source has different freshness and authority. In those cases, tool calling adds latency without adding reliability. The model might get different answers from each tool and have to decide which to trust, introducing another layer of uncertainty into the response.
Decision Rules
Use function calling when your tools have stable, well-documented interfaces that do not change frequently, you need structured predictable outputs that your application can process reliably, security and auditing of tool usage matter and you can enforce authorization at the routing layer, and you are building on OpenAI or a provider with mature function calling support. The maturity of tooling matters. OpenAI’s function calling has been production-tested at scale. Newer providers may have equivalent documentation but less production track record.
Do not use when your external systems have unstable or frequently changing APIs, you cannot invest in building a robust routing layer with authorization, validation, and logging, your tool use cases involve sensitive data that requires complex access control that the routing layer cannot enforce, or you are expecting to chain many tools together in ways that create cross-system data governance problems.
Consider MCP when you have many tools across many systems and enumeration is becoming a maintenance burden, tool interfaces change frequently and you want the model to discover changes dynamically, you want the model to be able to reason about available capabilities without upfront enumeration, and you are willing to manage the additional complexity of a tool registry as a production system. The registry is not free. It requires its own maintenance, monitoring, and governance.
The underlying principle: tool calling is infrastructure. It creates coupling between your model and your systems, and that coupling has maintenance costs that accumulate over time. Every tool call is a dependency. Dependencies require testing, versioning, and monitoring. Make sure the value justifies the integration effort before you commit, and design your routing layer as a first-class component with the same engineering rigor you would apply to any other critical system.
The model deciding to call a tool is not the hard part. The hard part is everything that happens around that decision: authorization, validation, error handling, logging, and the operational monitoring that tells you whether the tool calls are working as intended. Invest in the routing layer proportionally to how consequential the tool calls are. A system that only reads public data needs less rigor than a system that can modify customer records.
Design your routing layer with the assumption that the model will eventually make a mistake. It might call the wrong function. It might call a function with invalid arguments. It might call a function that has been superseded. Your routing layer should handle these cases gracefully, log them for debugging, and provide the model with error information it can reason about to recover from the error.