Model routing: why choosing Haiku, Sonnet, or Opus matters more than your prompt

Most teams that start with AI in production make one decision early: they pick one model and send everything there. It is understandable. It is simple to implement. And it is the direct path to high API bills and unnecessary latency.

The problem is not that Claude Opus is too expensive for everything. The problem is that Claude Haiku is too cheap not to use for simple tasks. And the line between "simple" and "complex" is exactly what most teams have not drawn.

This article is about drawing that line.

What it costs to route incorrectly

Start with numbers. Anthropic's public pricing (as of April 2026) shows roughly this pattern:

Claude Haiku: Cheapest. Fastest. Best for classification, simple extractions, formatting, routing decisions, and anything that does not require deep reasoning.

Claude Sonnet: Middle tier. Balances capability and cost. Good for most reasoning tasks, code explanations, context-aware summaries, and user-facing responses.

Claude Opus: Most expensive. Most capable. Reserved for tasks requiring multi-step reasoning, complex strategy analysis, advanced code generation, and nuanced interpretation of ambiguous input.

The price difference between Haiku and Opus is typically 15-20x per token. That means one Opus call for a task Haiku would handle correctly costs the same as 15-20 Haiku calls. In a system with hundreds of daily AI calls, that is the difference between a manageable API bill and one that surprises the CFO.

80% of AI cost savings in production come from correct model routing — not from prompt optimisation, caching, or batching. It is the highest-ROI action per engineering hour invested.

Then there is latency. Haiku typically responds in under a second. Opus can take 5-10 seconds on complex tasks. For user-facing features, that difference is noticeable.

Haiku: classification and simple extractions

Haiku is the most underestimated model in most teams' setup. It is built for tasks that are clearly defined, input-structured, and output-predictable.

Concrete use cases Haiku handles well:

Sentiment classification: "Is this customer review positive, negative, or neutral?" A question with three answers. Haiku handles it in under 500 tokens and under a second.

Entity extraction: "Extract the company name, contact person, and invoice amount from this email." Structured output from unstructured text. Haiku is fast and accurate when input patterns are relatively consistent.

Routing decisions: "Should this request go to technical support, sales, or billing?" Classification with a limited set of categories. Haiku as a gateway model reduces Opus usage by 60-80% in systems that need to distinguish between request types.

Formatting and transformation: JSON-to-markdown, markdown-to-HTML, structuring output from another model, normalising date formats.

Simple validation: "Is this a valid tax ID? Is this address formatted correctly?"

Haiku is not well-suited for tasks that require holding many conflicting pieces of information in working memory, understanding subtle context, or producing long, nuanced prose. It hallucinates more on ambiguous inputs.

Sonnet: mainstream reasoning

Sonnet is the workhorse. It handles the majority of user-facing AI calls in a typical enterprise system.

Code explanation and review: "What does this function do, and are there potential bugs?" Sonnet handles this effectively up to 2-3 class dependencies deep.

Customer query responses: Support automation with access to company-specific context. Sonnet can maintain thread continuity in a multi-turn conversation and produce responses that sound professional without overthinking simple content.

Context-aware summaries: Summarising documents, meeting notes, reports. Sonnet performs well when output length is defined and independent evaluation of conflicting facts is not required.

Strategic analysis, first pass: First drafts of analyses that are subsequently quality-checked. Sonnet as the draft model, Opus (or humans) as the reviewer.

Content generation: Product descriptions, marketing copy, email drafts. Sonnet is good at following tone-of-voice guidelines and producing professional output quickly.

The critical limitation: Sonnet can hit its ceiling on ambiguous, politically complex, or knowledge-intensive tasks. Give it a task requiring interpretation of contrast across fifteen pages of input and evaluation of the implications — that is Opus territory.

Opus: complex analysis and nuanced interpretation

Opus is the model you use when it costs something to be wrong, and average output is not good enough.

Strategy parsing: Analysis of a 60-page PowerPoint deck containing a company's strategy and extraction of Playing-to-Win elements, strategic themes, and initiatives — with correct handling of conflicting signals and management-speak.

Complex code generation: Generating architecturally correct code that respects existing patterns and dependencies from a specification with incomplete information.

Legal and compliance analysis: Interpreting regulatory text in a specific context, identifying edge cases, and generating recommendations held against specific criteria.

Complex problem diagnosis: AI-assisted debugging in multi-layer systems with conflicting signals — Opus maintains the reasoning thread better and produces more reliable root-cause analyses.

Nuanced content evaluation: Evaluating whether a piece of content meets many simultaneous criteria (tone, factual accuracy, brand compliance, legal constraints) and producing structured feedback.

Do not use Opus as your default model. Use it as a specialist.

The routeModel() pattern in practice

The technical implementation is simple: a routing function that selects a model based on a task complexity parameter.

type Complexity = "simple" | "standard" | "complex"

function routeModel(complexity: Complexity) {
  switch (complexity) {
    case "simple":   return models.haiku
    case "standard": return models.fast   // Sonnet
    case "complex":  return models.deep   // Opus
  }
}

The critical step is defining complexity against specific criteria, not intuition. A pragmatic approach:

Simple: Output is classification, extraction, or transformation. Input is structured or semi-structured. Haiku error rate < 2% in evaluation.

Standard: Output requires reasoning or prose generation with context. Input may be unstructured. Task is well-defined but non-trivial.

Complex: Output requires multi-step reasoning, interpretation of conflicting information, or high nuance. Sonnet error rate > 5% in evaluation. Latency is acceptable.

The most important principle: routing decisions should be based on evaluation, not intuition. Run 50-100 cases through Haiku and Sonnet, measure output quality, and set the threshold where quality degradation becomes noticeable.

The evaluation methodology: how to set the threshold

Many teams skip evaluation and set routing thresholds based on intuition. That is expensive. The correct method is to build a golden dataset.

A golden dataset is a collection of representative inputs with human-validated correct outputs. For a classification problem, that means 100-200 manually labelled examples. For a reasoning problem, it means 30-50 examples with domain expert scoring.

The process:

Collect inputs from production — real user requests, anonymised. Build 100 examples.

Run all 100 through Haiku and Sonnet independently. Log output and latency.

Ask a team member to score output quality on a 1-5 scale for each example, blind to which model produced it.

Calculate average score and error rate for both models. Find the point at which Haiku's error rate exceeds the acceptance threshold (typically 3-5%).

Set the routing threshold based on the measurement. Document it in the code comment.

This takes half a day for a simple classification problem. It saves months of overpricing and guesswork.

What causes routing to fail

Three mistakes are more common than others:

Static routing: All calls to one feature use the same model because "it was simplest to implement." Over time this means either overkilling simple tasks (expensive) or underpowering complex ones (bad output).

Routing based on feature type, not input characteristics: "Strategy analyses always use Opus" is a poor signal. A simple strategic update does not need Opus. A deep parsing of an ambiguous document does. Routing should respond to the concrete input, not the feature category.

No evaluation: Teams implement routing without measuring whether Haiku actually handles the tasks they route to it. Haiku hallucinates on ambiguous inputs and underperforms on context-sensitive tasks. Evaluate.

Using streaming as an excuse to avoid routing: Streaming outputs are harder to quality-assess, and that gets used as an argument for always using Opus. It is wrong. Streaming calls to Haiku return token-by-token exactly like Opus. The only difference is time-to-first-token — and Haiku is faster there too.

Connection to prompt caching and observability

Model routing works in synergy with two other cost optimisations: prompt caching and AI observability.

Prompt caching reduces the price of calls that have already been routed to the right model. If a Sonnet call with a long, static system prompt is cached, you pay 90% less on input tokens. The combination of correct routing (Haiku instead of Sonnet) and caching (90% discount on Sonnet calls) is multiplicative, not additive.

AI observability gives you the data you need to evaluate and improve routing over time. Without traces, you do not know whether Haiku is actually performing acceptably on the calls you have sent there. With traces, you can see error rate per model per feature and adjust the threshold based on actual data.

The three optimisations should be implemented in sequence: routing first, then caching, then observability. But they are designed to function as a system.

What to do tomorrow

Model routing is the highest-ROI action in most AI systems in production. Three steps to get started:

Week 1: List all AI calls in the product. Categorise them as simple, standard, or complex against the criteria above. Identify calls currently using Opus that could plausibly be handled by Sonnet or Haiku.

Week 2: Implement the routeModel() function. Run 50-100 cases from Opus calls through Sonnet and measure output quality. Set the threshold based on the measurement.

Week 3: Evaluate Sonnet calls that might be routable to Haiku. Measure. Decide.

Start with routing. Optimise prompts afterwards.

References

[1] Anthropic, "Claude Model Overview and Pricing", available at docs.anthropic.com/en/docs/models-overview (accessed 2026-04-23).

[2] Anthropic, "Build with Claude — Model Selection Guide", available at docs.anthropic.com/en/docs/about-claude (accessed 2026-04-23).

ai-cost architecture claude

ShareLinkedIn X

Spekir builds the layer that connects strategy to the IT portfolio. See Atlas →

EU AI Act for Midmarket — What You Actually Need to Do

A pragmatic roadmap for the IT manager or compliance coordinator who needs to translate the EU AI Act into action without a dedicated compliance team. The 20 things, prioritisation, and what is realistic.

9 min read →

Annex III Explained — When Is Your AI 'High-Risk'?

The eight Annex III categories explained with concrete examples from Nordic midmarket. When is your recruitment tool, credit scoring, or OT system high-risk under the EU AI Act?

8 min read →

Your AI Policy — 8 Sections You Cannot Skip

What must an AI policy contain? The eight mandatory sections, common mistakes, and what separates a policy that is actually used from one that lives in a PDF folder nobody opens.