Early Indicators of Agent Use Cases: What Anthropic's Data Shows

Despite all the talk about agents, where is real adoption happening? One place to look is the frontier models that agents call in their loops. The main frontier models behind these APIs are by Anthropic (Claude), OpenAI (GPT), and Google (Gemini).
Anthropic published a study this week analyzing the tool calls flowing through their API: programmatic actions by agents, not chat sessions where all intent is human. The full research covers a lot more ground: autonomy trends, how users shift their oversight strategy over time, agent-initiated clarification patterns. It’s data from a single provider, and the full context behind each action isn’t always known. But despite those limitations, it hints at where agent adoption is actually happening. I’m zooming in on the domain distribution angle here.
Software engineering accounts for nearly 50% of all agent tool calls. No surprise: developers are closest to the technology, and coding has the clearest validation loops. If specs and tests are crafted well, you run the code and it works or it doesn’t. As a sidenote, if you want to see this taken to the extreme, look at the Ralph loop: a coding agent running in a continuous cycle until all tests pass. Geoff Huntley’s technique went viral in late 2025 after people started shipping entire repos overnight with it, and Anthropic released an official plugin for it. It works precisely because code has a binary validation signal that other domains don’t have.
But the data also shows other domains are starting to appear. Business intelligence, customer service, sales, finance, e-commerce, and healthcare. None more than a few percentage points yet.
Balancing Risk and Autonomy
Unlike coding, where the agent’s output ideally goes through layers of review before reaching the real world, these domains involve direct interaction with people, systems, and decisions.
Anthropic plotted every cluster of agent actions on two axes: how autonomous the agent was, and how risky the action. The upper-right quadrant, high autonomy combined with high risk, is “sparsely populated but not empty.” Examples already appearing: patient medical records retrieval, reactive chemical handling, fire emergency response, cryptocurrency trading, production deployments. Whether these are production or evaluations, they point to where adoption is heading.
This is exciting. Agents moving beyond coding into domains where they can genuinely help people is exactly the promise. But the dynamics are different. In coding, a bad output ideally gets caught before it reaches anyone. In these domains, the agent’s action is the output: a customer email sent, a financial recommendation made, a medical record accessed, a sales commitment given. As teams get comfortable with agent autonomy in coding, they’ll bring those expectations to other domains. The verification infrastructure that makes coding forgiving doesn’t transfer automatically.
Accountability and Control Need to Catch Up
The Potential is clearly being explored. But the next question is whether Accountability and Control are ready for domains where mistakes aren’t reversible.
Their “Looking ahead” section is worth highlighting here, because it shows the gap between where agents are and where control and accountability need to be:
- Invest in post-deployment monitoring. Pre-deployment evaluations test what agents are capable of in controlled settings, but many findings “cannot be observed through pre-deployment testing alone.”
- Train models to recognize their own uncertainty. Agent-initiated stops are “an important kind of oversight in deployed systems” that complement external safeguards.
- Design for effective oversight, not prescriptive mandates. Requiring a human approval for every action creates “friction without necessarily producing safety benefits.”
- It’s too early to mandate specific interaction patterns. The focus should be on “whether humans are in a position to effectively monitor and intervene, rather than on requiring particular forms of involvement.”
Oversight should be treated as infrastructure, not process.
As I explored in the trust inversion, agents need to be restricted to what they can do, not from what they can’t. That means things like identity, fine-grained authorization, bounded delegation across hops, and auditable decision trails. It starts with something fundamental: knowing when an agent acted, what it decided, and who’s accountable for the output.
Even Coding Isn’t a Walk in the Park
And here’s the thing: even the domain with the most verification infrastructure isn’t as safe as the data suggests.
Many codebases carry significant technical debt. Code that was never designed to be easily verified, minimal automated checks. A bad change gets caught by a customer, not a test.
Even where safeguards exist, the complacency pattern is real: after twenty correct outputs, who reviews the twenty-first carefully? And in a team, code committed under a developer’s account looks the same whether the human or the agent wrote it. If something breaks three months later, who understood the decision?
If the best-equipped domain is already stretching its safety net, the rest can't afford to start without one.
That’s why an actionable framework that brings in all three perspectives matters.
I work on the trust infrastructure that makes agents viable beyond their current comfort zone: protocol explainers, a framework, and hands-on consulting.