Skip to main content
MindStudio
Pricing
Blog About
My Workspace

Claude Sonnet 5 Token Efficiency Problem: Why It Can Cost More Than Opus 4.8

Claude Sonnet 5 uses 30% more tokens than previous models. Learn why this happens and how to manage costs in agentic AI workflows.

MindStudio Team RSS
Claude Sonnet 5 Token Efficiency Problem: Why It Can Cost More Than Opus 4.8

The Hidden Cost Trap in Claude’s Newest Models

If you’re running AI agents with Claude, you’ve probably made the same assumption most teams do: use Claude Sonnet 5 to save money, escalate to Claude Opus when you need more power. It’s a reasonable approach — Sonnet is priced at a fraction of Opus. But that math breaks down fast once you look at actual token consumption in real workflows.

Claude Sonnet 5 uses approximately 30% more tokens than earlier Claude models for equivalent tasks. In a single short conversation, that’s barely noticeable. But in agentic workflows with multiple steps, tool calls, and reasoning chains, that verbosity compounds. The result: Sonnet 5 jobs that cost more than equivalent Opus runs.

This article explains exactly why this happens, shows you the math, and walks through practical ways to keep costs under control without sacrificing capability.


Why Newer Claude Models Generate More Tokens

This isn’t a bug. It’s largely intentional — newer Claude models are trained to reason more carefully, produce more thorough responses, and narrate their logic more explicitly. That’s part of what makes them better at complex tasks.

But “better reasoning” often means “more words.”

Increased Output Verbosity

Claude Sonnet 5 tends to produce longer, more detailed responses than its predecessors. Where Sonnet 3.5 might answer a structured prompt in 200 tokens, Sonnet 5 might use 280–300. The quality of the answer is often better, but the token count is reliably higher.

This isn’t random. Anthropic’s training methodology for newer models prioritizes completeness and nuance. The model is less likely to cut corners or produce terse output that leaves the user to fill in gaps. That’s valuable in customer-facing applications. In batch processing or automated pipelines, it can silently inflate costs.

Extended Thinking Tokens

Claude Sonnet 5 and certain Opus models support extended thinking — where the model works through a problem internally before producing a final response. These thinking tokens are billed separately from your regular input/output, and they don’t show up in the visible response.

Extended thinking can add thousands of tokens to a single call. If you’re not monitoring it, you may not even realize it’s happening. A task that looks like it consumed 400 output tokens may have generated 3,000+ thinking tokens underneath.

System Prompt and Context Expansion

Newer models often require more detailed system prompts to behave predictably. Teams upgrading from older Claude versions frequently discover they need to add more explicit instructions — which means more input tokens per call, before the model has even started working.


The Agentic Workflow Multiplier

Single API calls are easy to reason about. Agentic workflows are where token costs become genuinely hard to predict.

In a multi-step agent, every node in the workflow is a separate call. Each call has its own input context — and that context often includes the accumulated output from previous steps. Here’s what that looks like in practice:

A 5-step research-and-report agent:

  1. Step 1: Fetch sources (500 tokens in, 300 out)
  2. Step 2: Summarize sources — input includes step 1 output (800 tokens in, 400 out)
  3. Step 3: Extract key claims — input includes steps 1–2 (1,200 tokens in, 300 out)
  4. Step 4: Draft section — input includes prior steps (1,500 tokens in, 600 out)
  5. Step 5: Final review — input includes full draft (2,100 tokens in, 200 out)

In this simplified example, you’ve made 5 calls but consumed over 6,000 input tokens and 1,800 output tokens — and that’s before you account for extended thinking, retry logic, or tool call results being appended to context.

If each step generates 30% more output than it would have with an older model, that excess text gets passed into every subsequent step. The verbosity doesn’t just add tokens locally — it amplifies downstream.

Tool Call Overhead

Modern Claude agents use tool calling extensively — web search, code execution, database queries, file reads. Each tool call returns a result that gets appended to the conversation context. If the model is verbose in how it invokes tools or interprets results, the token overhead from tool use alone can be substantial.


The Math: When Sonnet 5 Costs More Than Opus

Let’s use approximate current pricing to show how this plays out. As of mid-2025, Claude pricing tiers look roughly like:

ModelInput (per million tokens)Output (per million tokens)
Claude Sonnet 5~$3~$15
Claude Opus 4.x~$15~$75

On a per-token basis, Opus costs 5x more than Sonnet. So Sonnet would need to generate 5x the tokens to reach the same cost as Opus for equivalent output.

That sounds like an enormous buffer. And for simple, single-turn tasks, it is. But consider an agentic workflow:

Scenario: A 10-step automation agent running 500 times per month.

  • With Opus: Each job consumes 2,000 input tokens, 800 output tokens. Monthly cost: ~$31.50
  • With Sonnet 5 (no extended thinking): Each job consumes 2,600 input tokens, 1,040 output tokens (30% more). Monthly cost: ~$11.82

Remy doesn't write the code. It manages the agents who do.

R
Remy
Product Manager Agent
Leading
Design
Engineer
QA
Deploy

Remy runs the project. The specialists do the work. You work with the PM, not the implementers.

Sonnet still wins here. But add extended thinking:

  • With Sonnet 5 + extended thinking: Each step triggers 1,500 thinking tokens. At 10 steps per job, that’s 15,000 additional tokens per run. Monthly cost: ~$34.32

Now Sonnet 5 costs more per job than Opus — without any additional capability benefit, and without the team even being aware that extended thinking is on.

This isn’t a contrived edge case. Teams running complex agents with extended thinking enabled by default routinely discover this pattern when they first look closely at their billing.


What’s Driving Extended Thinking Costs

Extended thinking is enabled either by default in some integrations or configured explicitly. The problem is that it often doesn’t advertise itself clearly in cost dashboards — thinking tokens show up under a different category, and many platforms aggregate costs in ways that obscure the breakdown.

When Extended Thinking Actually Helps

Extended thinking genuinely improves performance on:

  • Complex multi-step reasoning tasks
  • Tasks with ambiguous constraints or competing priorities
  • Code generation where correctness matters more than speed
  • Problems that require working through several hypotheses

When It’s Just Burning Tokens

For straightforward tasks — classification, formatting, extraction, summarization — extended thinking rarely changes the output quality meaningfully. Leaving it enabled on these tasks adds cost without benefit.

The fix is simple in principle: disable extended thinking for tasks that don’t need it, and set a token budget for tasks that do. But you need observability to know which is which.


Strategies to Control Claude Sonnet 5 Token Costs

Audit Token Usage Before Optimizing

Before adjusting anything, measure what’s actually happening. Log input tokens, output tokens, and thinking tokens separately for each workflow step. Many teams are surprised to find that 60–70% of their token spend comes from 2–3 specific workflow nodes.

Once you know where the cost is concentrated, you can target fixes precisely.

Write Tighter System Prompts

Verbose system prompts compound across every call in a workflow. A 500-token system prompt passed into 10 workflow steps costs 5,000 input tokens before any task content is processed.

Review your system prompts for:

  • Redundant instructions that say the same thing multiple ways
  • Defensive language added “just in case”
  • Examples that could be shortened or removed
  • Role descriptions longer than they need to be

Cutting your system prompt by 30% can reduce input costs meaningfully across a high-volume workflow.

Control Output Length Explicitly

Claude responds well to direct length constraints. Instructions like “Respond in no more than 150 words” or “Return only the JSON object, no explanation” are effective and consistent.

Don’t rely on the model’s default judgment about appropriate response length, especially for structured tasks. Tell it exactly what format and length you expect.

Use Context Pruning in Long Workflows

In multi-step agents, don’t pass the full output of every previous step into every subsequent step. Pass only what the next step actually needs.

In 60 minutes, you'll know Hermes
The free Hermes Agent crash courseReserve your spot

If step 3 needs a structured summary from step 2, extract that summary before passing it forward — don’t pass the full step 2 response. This requires more deliberate workflow design, but it keeps context windows lean.

Match the Model to the Task

Not every step in a workflow needs Sonnet 5. A routing step that classifies user intent into three categories can use a smaller, faster model. Sonnet 5 is worth its cost on tasks that require nuanced reasoning. It’s overkill for deterministic pattern matching.

Routing different workflow steps to different models based on task complexity is one of the most effective cost optimizations available. The challenge is having a platform that makes this easy.


How MindStudio Helps You Manage Model Costs

This is exactly the kind of problem that gets painful to manage manually. When you’re running multiple agents across different workflows, tracking token usage at the step level — and actually doing something about it — requires infrastructure that most teams don’t build themselves.

MindStudio gives you access to 200+ AI models in a single no-code environment, which matters here for a specific reason: you can swap models between workflow steps without rewriting anything. If you’ve identified that your routing node is burning tokens on Sonnet 5 when it could use a faster, cheaper model, you change one dropdown.

For extended thinking specifically, MindStudio exposes model configuration at the step level. You can enable it for the reasoning-heavy steps and leave it off for extraction and formatting tasks — without coordinating multiple API integrations or managing separate model clients.

The platform also makes it straightforward to build agentic workflows with controlled context passing between steps. Rather than letting context balloon unchecked across a 10-step agent, you can explicitly define what data moves forward at each handoff.

If you’re running automation at volume and Claude costs are a concern, MindStudio’s model-routing capabilities let you optimize without rebuilding your agent from scratch every time pricing or performance characteristics shift.

You can try it free at mindstudio.ai.


FAQ

Why does Claude Sonnet 5 cost more than Opus in some cases?

Claude Sonnet 5 has a lower per-token price than Opus, but it generates significantly more tokens for comparable tasks — roughly 30% more in typical workflows. In agentic workflows where verbosity compounds across multiple steps, and especially when extended thinking is enabled, total token consumption can push Sonnet 5’s actual cost above Opus. The per-token price doesn’t tell the full story; total tokens consumed does.

What are extended thinking tokens and how do they affect cost?

Extended thinking is a feature where Claude works through a problem internally before producing a final response. These reasoning steps consume tokens that are billed separately — they’re not part of the visible response but they do appear on your bill. For complex tasks, extended thinking can add thousands of tokens per call. If you have it enabled by default on simple tasks, it adds cost without any quality benefit.

How do I find out how many tokens my Claude agents are actually using?

Get set up on Hermes in 1 hour
The free Hermes Agent crash courseReserve your spot

The most reliable approach is to log token usage at the API response level, breaking out input tokens, output tokens, and (if using extended thinking) thinking tokens separately. Many AI platforms provide aggregate token counts that obscure where cost is actually concentrated. Step-level logging lets you identify which specific workflow nodes are the cost drivers.

Is it always better to use a cheaper model like Sonnet over Opus?

Not automatically. For complex reasoning tasks — multi-step analysis, nuanced judgment calls, code generation where correctness matters — Opus often produces better outputs that require fewer retries and downstream corrections. The total cost of using a cheaper model that fails more often can exceed the cost of using Opus correctly the first time. The right comparison is cost per successful outcome, not cost per token.

How can I reduce token usage without downgrading my model?

Several techniques help: write concise system prompts and audit them for redundancy, add explicit output length constraints in your instructions, prune context between workflow steps so you’re only passing forward what the next step needs, disable extended thinking on tasks that don’t require deep reasoning, and use structured output formats (JSON, CSV) that contain less natural-language filler than prose responses.

Does prompt caching help with these costs?

Yes, significantly. Anthropic supports prompt caching for repeated content like system prompts and long documents. If you’re making many calls with the same system prompt, enabling caching can reduce the effective input token cost substantially. Claude charges a lower rate for cache read tokens versus fresh input tokens. This doesn’t fix output verbosity, but it addresses the input side of the equation, especially in workflows where system prompts are long.


Key Takeaways

  • Claude Sonnet 5 generates roughly 30% more tokens than earlier models on equivalent tasks — its lower per-token price doesn’t automatically make it cheaper in practice
  • Extended thinking tokens are billed separately and often hidden in aggregate cost summaries; they’re a major source of unexpected spend
  • In agentic workflows, verbosity compounds — excess output from each step inflates the input context for every subsequent step
  • The breakeven point where Sonnet 5 costs more than Opus depends on workflow length, extended thinking usage, and how tightly context is managed
  • Effective cost optimization requires step-level token logging, explicit output constraints, targeted model routing, and deliberate context pruning
  • Platforms like MindStudio that support per-step model configuration make it practical to route the right model to the right task without rebuilding your agent infrastructure

The general lesson: don’t assume a model’s tier determines your cost. Measure actual token consumption per workflow, find your cost-concentration points, and optimize there — that’s where the real savings are.

Related Articles

What Is the Anthropic Advisor Strategy? How to Cut AI Agent Costs Without Sacrificing Quality

The Anthropic Advisor Strategy uses Opus as an expert adviser and Haiku or Sonnet as executors, reducing costs by 12% while improving performance on hard tasks.

Claude Optimization Automation

Claude Sonnet 5 vs Opus 4.8: Which Model Should You Use for Agentic Work?

Claude Sonnet 5 is cheaper but uses more tokens than Opus 4.8. Here's how to choose the right model for your agentic workflows and budget.

Claude LLMs & Models Comparisons

Confidence-Scheduled Verification: How DeepSpark Cuts Wasted GPU Compute in AI Agents

DeepSpark's confidence-scheduled verifier skips low-probability tokens under load, saving GPU resources and speeding up production AI agent inference.

LLMs & Models Automation Optimization

What Is DeepSpark? DeepSeek's Speculative Decoding Method That Makes Every LLM Faster

DeepSpark is DeepSeek's open-source speculative decoding system delivering 50–400% faster inference without retraining. Here's how it works.

LLMs & Models Automation AI Concepts

AI Agent Token Budget Management: How Claude Code Prevents Runaway API Costs

Claude Code enforces hard token limits, compaction thresholds, and pre-execution budget checks. Here's how to implement the same pattern in your own agents.

Claude Multi-Agent Optimization

Anthropic's Harness Detection Bug: 3 Things That Triggered Unexpected Claude Code Charges

A git commit mentioning 'hermes.md' triggered a $200.98 overage on a plan showing 86% unused. Here's exactly what caused it and how Anthropic responded.

Claude Security & Compliance Optimization

Presented by MindStudio

No spam. Unsubscribe anytime.