ARTICLE
AI Coding Agent Accuracy: Opus 4.7 vs 4.8
Discover the efficiency gains of AI coding agents Opus 4.7 vs 4.8. Learn how 4.8 improves cost and performance in real tasks. Upgrade smartly!

Rob Willoughby

You are deciding whether to roll your default agent model from Opus 4.7 to 4.8. The release notes promise improvements, the leaderboard moves a fraction of a point, so you shrug, schedule the upgrade for a quiet Friday, and move on.
We ran both versions through the same skills evaluation, roughly 850 scenarios solved twice each, and on the headline metric they finished level. Underneath the tie, though, 4.8 reached the same answers in four fewer turns and for measurably less money, so the upgrade that looks like a non-event on the scoreboard turns out to be a real efficiency gain in the place that actually bills you: the agent loop.
AI agent evaluation measures how an agent behaves on real tasks rather than only scoring its final answer, tracking cost, turns, and reliability across paired runs. The reason to bother is that two models can post the same score while spending very different amounts of work to reach it.
Two versions, one eval harness
Both models ran the identical setup. Every scenario is solved twice, once with no help and once with the relevant skill installed, so we can isolate what the skill contributes from what the base model already knows. We score three things: instruction following (did the agent do what the skill tells it to do), task completion (did it reach the goal), and an overall blend weighted toward instruction following. We also flag integrity issues, like an agent peeking at the grading rubric instead of solving the task.
Opus 4.7 is the incumbent. In our runs it is a strong agent that leans heavily on skills to reach its ceiling, and it explores a lot of paths to get there.
Opus 4.8 is the point release. It posts the same ceiling with a skill installed, but it starts from a higher floor without one, and it gets to the answer with noticeably less wandering.
Where AI coding agent accuracy stops being the story
Here is the head-to-head on the shared scenario set, all with the relevant skill installed unless noted.
| Dimension | Opus 4.7 | Opus 4.8 |
|---|---|---|
| Overall score | 91.9 | 92.1 |
| Baseline score, no skill | 71.4 | 74.1 |
| Task completion | 97.1 | 97.4 |
| Instruction following | 88.1 | 88.1 |
| Turns per task | 19.2 | 15.0 |
| Output tokens per task | 7,820 | 9,763 |
| Cost per task, API pricing | baseline | about 5% lower |
| Integrity flags raised | 10.2% | 7.9% |
The overall accuracy gap is 0.2 points. If you stopped reading the row labeled "overall score," you would conclude nothing changed. Three other rows complicate that picture.
The first is the baseline. Without any skill, 4.8 scores 74.1 against 4.7's 71.4, a 2.6 point gain, and its no-skill instruction following climbed from the high 50s into the low 60s. The ceiling is shared because the skill pulls both versions up to roughly the same place. The floor is where 4.8 actually improved, and that has a practical consequence: 4.8 depends on the skill slightly less to do good work. This suggests some of the knowledge previously only present in skills has been trained into the model weights.
The second is turns. 4.8 finishes the average task in 15.0 turns versus 19.2 for 4.7, a 21% reduction. In an agent loop, a turn is a full round trip of context, reasoning, and tool use. Cutting four turns off the average task lowers latency, reduces the chances for an agent to talk itself into a wrong path, and, as we will see, lowers cost.
The third is integrity. The eval flags runs where the agent took a shortcut, like reading the grading rubric or reaching outside its workspace. Those flags dropped from 10.2% of shared runs to 7.9%. 4.8 is modestly more disciplined about how it reaches an answer. This matches Anthropic’s claims about 4.8 being more honest.
Reading the cost: turns, not tokens
Look again at two rows that seem to contradict each other. 4.8 produces more output per task, 9,763 tokens against 7,820, yet it costs about 5% less.
This is because output volume does not dominate agentic cost. The dominant term is the context replayed on every turn. Each turn re-sends the accumulated conversation and tool results, and in long agent runs that cached input swamps the fresh output the model writes. Fewer turns means fewer replays, so 4.8 can be more verbose inside each turn and still come out ahead, because it takes four fewer turns to converge.
Model cards only show the per-token rate that sets the price of a unit of work, while turn count sets how many units the model decides to spend. A point release that holds accuracy flat while spending 21% fewer turns is working on that second term, which is the one that scales with your usage.
The same dynamic shows up in how each version absorbs a skill. Adding the relevant skill is not free: it pulls in instructions and reference material the agent has to process, and the question is how efficiently the model turns that overhead into a result.
| Effect of installing the skill | Opus 4.7 | Opus 4.8 |
|---|---|---|
| Overall score gain | +20.5 | +18.0 |
| Cost increase | +38% | +12% |
| Turn increase | +41% | +14% |
On 4.7, switching on a skill added 41% more turns to cash in a 20 point accuracy gain. On 4.8, the same class of skill buys nearly the same gain for much less turn and cost overhead. 4.8 treats a skill more like a shortcut and less like an invitation to explore. If you run agent skills at scale, that lower skill tax compounds across every task you ship.
The one place 4.8 regressed
A fair comparison reports where the new version loses ground. Per scenario, the record is close to a wash: 4.8 scored higher on 23% of shared tasks, tied on 61%, and scored lower on 17%, using a two point threshold. The interesting part is that the losses cluster.
4.8 regressed on web research and scraping skill families. Firecrawl tasks dropped 3.3 points on average across 72 scenarios. LangChain dropped 2.9 points across 48. Smaller families like Tavily and Apify fell further, 10.4 and 7.6 points, though on fewer tasks. Meanwhile 4.8 improved on infrastructure, auth, and code tooling: Cloudflare gained 4.5 points across 38 scenarios, Auth0 gained 4.3 across 18, and Mastra gained 10.1 across 10.
The aggregate hid this completely, because the gains and losses nearly cancel. Only a per domain breakdown surfaces it. That is the whole argument for paired skill evals over a single leaderboard number: the headline can be a tie while two coherent shifts run in opposite directions underneath it.
When to roll forward to 4.8
Roll forward to 4.8 if your agents run long, multi turn tasks where turn count, latency, and cost matter, which is most production agent work. You get the same accuracy ceiling, a higher floor before skills, a 21% turn reduction, a cheaper skill tax, and fewer integrity flags. If your workloads lean on infrastructure, auth, or general code tooling, 4.8 is flat to clearly better.
Test before you roll forward if your agents live in the scrape, crawl, and summarize world. The web research regression is small in absolute terms but consistent across the families we measured. Run your own A/B on your top scraping workflows first.
The takeaway: measure behavior, not the changelog
A skeptic has two reasonable objections. The first: a flat score is just no improvement, so why care? Two models can tie on accuracy while one spends 21% more turns and about 5% more budget to get there. The second: these are our eval harness costs. However, the relative differences in turns, tokens, and cost reflect model behavior which does generalize.
Make sure you’re measuring each release on behavior, on your own tasks, with skills installed and stripped out, and look at the per domain breakdown before you trust the average.
Want to see how your own stack behaves across a model upgrade? Browse the Tessl Registry to find the skills your agents depend on, then run the same paired evaluations we used here to measure what actually changed.
COPY & SHARE

Rob Willoughby
Member of Technical Staff, AI Research Lead at Tessl
READING
·
0%
IN THIS POST
COPY & SHARE

Rob Willoughby
Member of Technical Staff, AI Research Lead at Tessl
YOUR NEXT READ
Why We're Changing Our Default Eval Model
The default eval model is changing from Claude Sonnet 4.6 to GLM 5.1 to reduce costs without losing signal quality, focusing on skill evaluation over model specificity.

Rob Willoughby

