Chris Clark, Co-Founder and COO of Open Router, sits at one of the most unique vantage points in AI — running the world’s largest AI gateway, processing over a trillion tokens per day across 70+ providers. Here’s what he’s actually seeing in the data.
If you want to know whether AI agents are really in production or just hype, stop reading blog posts and start looking at tool call rates.
Open Router processes trillions of tokens per week across hundreds of models, dozens of clouds, and every geography. They see the actual usage — not what companies say they’re doing, but what they’re actually doing at scale.
The data tells a clear story.
Tool Call Rates Went From <5% to >25% in 12 Months
Here’s the single most important chart the Open Router team has published.
When you look at Anthropic API calls routed through Open Router — which Chris argues is broadly representative of global usage trends — the percentage of requests that ended with the model requesting a tool call went from sub-5% to well north of 25% in about a year.
That’s not tinkering. That’s production.
To understand why this matters, you need to understand what tool calls actually are. An LLM can only do one thing: take text in, output text. That’s it. It can’t call Stripe. It can’t query your database. It can’t browse the web. What it can do is output structured text that says “please call this function for me” — and then a software harness executes that call, passes the result back, and the loop continues.
An agent is just that loop running repeatedly. LLM → tool call request → software executes → result fed back to LLM → repeat.
When you see tool call rates exploding, you’re seeing agents going from experimental to operational at companies all over the world.
For some agent-specialized models like Minimax M2, tool call rates are running at 80%+. That’s a model built to do almost nothing but run in agentic loops.
The July Inflection Point
The quantitative data tells part of the story. The qualitative signal is equally striking.
Around July 2024, Open Router’s sales and BD team noticed something: customers started asking about SLAs.
Not features. Not pricing. SLAs.
In March, nobody cared. They were signing commits, doing deals — the usual. Then in July, companies started asking: “What’s your uptime? How should we think about failover? How do we reason about your uptime relative to the model’s uptime?”
That question only matters if the agent going down has real consequences. Nobody negotiates SLAs for a prototype.
The shift was stark enough that the team noticed it without looking for it. Companies had moved their agents from “we’re testing this” to “if this goes down, we have a problem.”
Reasoning Tokens Went From Zero to 50% of Output in 12 Months
It’s easy to forget: there were no mainstream reasoning models 13 months ago. O3 from OpenAI launched in early 2025, was expensive and slow, but was the first real entrant. Before that, the category didn’t exist.
Today, 50% of the output tokens Open Router sees are internal reasoning tokens from models — the chain-of-thought happening inside the model before it produces its actual answer.
That’s an extraordinary adoption curve for a capability that literally didn’t exist 12 months prior.
How Production Agents Are Actually Built
Based on what Open Router sees at scale, here’s the emerging architecture for production agents:
Frontier reasoning models for planning and judgment. The best models from major labs — Claude Sonnet, GPT-4o, Gemini — handle the hard thinking. Where do we go? What’s the strategy? What’s the right decision given these constraints?
Smaller, specialized open-weight models for tool execution. Once the plan exists, companies are increasingly routing tool calls to smaller, faster, cheaper models — particularly Chinese open-weight models like the Qwen family — that aren’t as smart in the general sense but are extremely accurate at structured tool use.
The pattern: frontier model plans, smaller model executes. As companies get comfortable with a use case, they “downclutch” — keeping the frontier model for judgment, using open-weight models to drive cost down on the high-frequency execution steps.
This is why Chinese open-weight models have taken surprising market share. They’re disproportionately heavy in agentic flows run by U.S. firms.
The Inference Quality Problem Nobody Talks About
Here’s something counterintuitive that the Open Router data reveals: the same model, hosted by different providers, can perform meaningfully differently.
They ran GPT-O4S120B against the GPQA benchmark across multiple major cloud providers all hosting the same model weights. Performance varied — sometimes significantly — across providers.
More interesting: the same model hosted by different clouds chose to use tools at different rates. Exact same model, exact same weights — but depending on how the inference stack is implemented, the model might call a tool 60% of the time on one provider and 40% on another.
That’s not a flaw in the model. It’s a property of the inference environment, and it’s real at scale.
Open Router responded by building “exacto endpoints” — routing pools that only send agentic workloads to providers they’ve benchmarked as accurate for tool calling specifically. If 15 clouds support a given model, maybe 5 of them are consistently good at tool use. They route agent traffic only to those 5.
This matters enormously for anyone building production agents. Debugging “why does my agent fail 5% of the time” and not knowing whether it’s the prompt, the model, or the inference provider is a nightmare. The infrastructure layer is not neutral.
The #1 Mistake Founders Make Building Agents
Chris’s answer: build for optionality. Avoid lock-in.
The model landscape is moving so fast that anyone who tells you they know what the right model for their use case will be in 12 months is lying to themselves. New frontier models drop every few months. A model that was the right call for your use case in Q1 might be eclipsed by something 3x better in Q3.
The failure mode he sees repeatedly: a company builds an agent tightly coupled to a specific provider’s APIs. Then a better model releases on a different cloud, or they need failover, or they want to experiment with open-weight models for cost reduction — and they realize they’ve painted themselves into a corner.
His advice for starting from scratch: use a gateway like Open Router, pick the best frontier model available so that model quality is never the reason you fail, and then iterate from there. Once it’s working, optimize cost by routing subsets of the workload to cheaper models. But don’t let infrastructure constraints become a false negative on whether your use case is actually solvable.
The Enterprise Bottleneck: Data Policy
For enterprises, the question isn’t whether agents work. It’s where the data goes.
When you’re running sensitive enterprise data through inference providers, questions like “where are their GPUs located?”, “do they own the hardware or is it leased across multiple data centers?”, and “where does decryption actually happen?” start to matter a lot. These aren’t paranoid questions — they’re reasonable ones for any company running sensitive data through external AI infrastructure.
The companies that crack this — provable data residency, clear audit trails, controlled inference environments — are going to win the enterprise segment. The technology problem is mostly solved. The governance and compliance layer is where the friction lives.
The bottom line from a trillion tokens a day: agents are in production, tool call rates are accelerating, and the companies that figure out multi-model orchestration and inference reliability are the ones that will have durable cost and performance advantages over the ones that don’t.
The data is in. The question is whether you’re building accordingly.
