Claude Opus 4.7: Benchmarks, Features & Breakthroughs

Anthropic released Claude Opus 4.7 on April 16, 2026, replacing Opus 4.6 as its most capable generally available model. The upgrade delivers a 12-percentage-point jump on CursorBench (70% vs 58%), 3x better vision acuity, and a new self-verification capability that Vercel calls "behavior we haven't seen from earlier Claude models" (Anthropic, 2026). Here's what actually matters for developers.

What's New in Opus 4.7?

Opus 4.7 is Anthropic's flagship generally available model, sitting just below the restricted Mythos Preview tier. According to Anthropic's official announcement, it resolves 3x more tasks on Rakuten-SWE-Bench than its predecessor and introduces self-verification of outputs as a first-class capability (Anthropic, 2026). The model ID is claude-opus-4-7, with a knowledge cutoff of January 2026.

The headline story isn't a single feature. It's a collection of practical improvements that compound during long coding sessions. Better instruction following, fewer tool-call errors, and persistent file-system memory across sessions all target the same problem: keeping agentic workflows on track without constant human intervention.

What separates this release from typical "bigger model" announcements is the emphasis on reliability metrics over raw capability scores. Anthropic led with partner benchmarks from Cursor, Windsurf, and Notion rather than academic leaderboards. That signals a shift in how frontier labs define "better."

Abstract AI illustration with silhouette head symbolizing artificial intelligence and technology

How Does It Compare to Opus 4.6?

Think of Opus 4.7 as Opus 4.6 with the rough edges sanded down. The core architecture stays similar, but instruction adherence, vision processing, and multi-step task completion all improve substantially. Notion reports that Opus 4.7 is the "first model to pass our implicit-need tests," meaning it infers what the user actually wants rather than just executing literal instructions (Anthropic, 2026).

The updated tokenizer produces 1.0-1.35x more tokens per request. That's worth noting for cost estimation, though Anthropic kept pricing identical to Opus 4.6.

Key Features and Improvements

Opus 4.7 introduces six features that directly affect developer workflows. The self-verification capability alone reduced tool errors by two-thirds in Notion's multi-step benchmark (Anthropic, 2026). Here's the full list.

Self-Verification of Outputs

The model now checks its own work before returning results. Vercel's Joe Haddad called this "new behavior we haven't seen from earlier Claude models." In practice, this means fewer obviously wrong code generations and more consistent output quality across long sessions.

Why does this matter? Because the biggest cost of AI-generated code isn't the API bill. It's debugging incorrect output. Self-verification targets that exact problem.

Vision Processing Overhaul

Vision capabilities jump dramatically. The model now supports a 2,576px long edge (roughly 3.75 megapixels), and XBOW visual-acuity scores rose from 54.5% to 98.5% (Anthropic, 2026). That's not an incremental improvement. It's a category shift.

For developers working with screenshots, design mockups, or document images, this means Opus 4.7 can actually read what's on screen. Previous models struggled with small text, dense UI layouts, and low-contrast elements.

Adaptive Thinking and xhigh Effort

A new xhigh effort level lets developers trade latency for accuracy on hard problems. Hex's benchmark found that "low-effort Opus 4.7 approximately equals medium-effort Opus 4.6" (Anthropic, 2026). That's a meaningful efficiency gain for production pipelines where you can't afford max-effort latency on every call.

Adaptive thinking lets the model allocate reasoning depth dynamically. Simple questions get fast answers. Complex multi-step problems get deeper analysis.

File-System Memory Across Sessions

Opus 4.7 maintains better awareness of project file structures across sessions. For agentic coding tools like Cursor and Windsurf, this reduces the "context re-loading" problem where the model forgets project layout between interactions.

Expanded Context and Output

The model supports a 1M token context window (approximately 555,000 words) and 128K max output tokens, extending to 300K via the Batch API (Anthropic, 2026). That's enough to ingest entire medium-sized codebases in a single prompt.

Cyber Safeguards

Anthropic added a Cyber Verification Program with new safeguards. This is less relevant for most developers but signals Anthropic's continued focus on responsible deployment as model capabilities increase.

Benchmark Results

Opus 4.7 shows consistent improvements across every partner benchmark Anthropic published. On CursorBench, it scores 70% compared to 58% for Opus 4.6, a 12-percentage-point improvement (Anthropic, 2026). But the most striking result is the Rakuten-SWE-Bench score: 3x more tasks resolved than the previous model.

| Benchmark | Evaluator | Opus 4.7 | Opus 4.6 | Improvement | |---|---|---|---|---| | CursorBench | Cursor | 70% | 58% | +12pp | | 93-task coding benchmark | Windsurf | N/A | N/A | +13% resolution rate | | Visual Acuity | XBOW | 98.5% | 54.5% | +44pp | | Multi-step workflows | Notion | N/A | N/A | +14%, 1/3 tool errors | | Rakuten-SWE-Bench | Rakuten | N/A | N/A | 3x more tasks resolved | | BigLaw Bench | Harvey | 90.9% | N/A | Substantive accuracy at high effort | | OfficeQA Pro | Databricks | N/A | N/A | 21% fewer errors | | Code Review Recall | CodeRabbit | N/A | N/A | +10% recall | | Efficiency benchmark | Hex | N/A | N/A | Low effort = Opus 4.6 medium | | Droids task success | Factory | N/A | N/A | 10-15% lift | | App-building benchmark | Bolt | N/A | N/A | Up to 10% better |

What Stands Out?

Three results deserve attention. First, the XBOW visual-acuity jump from 54.5% to 98.5% is the largest single improvement. Second, Windsurf reports that Opus 4.7 solved four tasks that neither Opus 4.6 nor Sonnet 4.6 could complete (Anthropic, 2026). Those aren't marginal gains. They represent new capabilities.

Third, Notion's benchmark combines accuracy improvements (+14%) with efficiency gains (fewer tokens used and one-third the tool errors). That's the rare benchmark where the model got both smarter and cheaper to run.

Futuristic robotic hand touching a digital neural network on a blue background

Real-World Developer Impact

Partner testimonials tell a more practical story than benchmarks alone. Cognition's CEO (Devin) reports that Opus 4.7 "works coherently for hours, pushes through hard problems rather than giving up" (Anthropic, 2026). That addresses the single biggest frustration with agentic coding: models that abandon complex tasks mid-execution.

Agentic Coding Gets More Reliable

The combination of self-verification, better instruction following, and file-system memory targets a specific failure mode. Previous models would drift off-task during long sessions, introduce subtle bugs they didn't catch, or lose track of project structure. Opus 4.7 reportedly handles these scenarios better.

Factory reports a 10-15% lift for their Droids product. Bolt sees up to 10% improvement. These aren't revolutionary numbers in isolation, but compounded across hundreds of daily interactions, they add up.

Based on the partner benchmarks, the average improvement across all reported metrics is approximately 15-20%, with vision being the clear outlier at 44 percentage points.

Design and Creative Work

Lovable's CEO called Opus 4.7 the "best model in the world for building dashboards" (Anthropic, 2026). Combined with the higher "creative/design taste" that Anthropic claims, this positions Opus 4.7 as a stronger option for frontend and design-adjacent work.

Code Review and Quality

CodeRabbit reports a 10% improvement in code review recall (Anthropic, 2026). That means the model catches more issues during automated code review. For teams using AI-assisted review pipelines, that's a direct quality improvement without any workflow changes.

Pricing and Availability

Pricing stays at $5 per million input tokens and $25 per million output tokens, identical to Opus 4.6 (Anthropic, 2026). However, the updated tokenizer produces 1.0-1.35x more tokens for the same text. In the worst case, that's a 35% effective price increase for some workloads.

Where Can You Access It?

Opus 4.7 is available through four channels:

Claude API (direct from Anthropic)
Amazon Bedrock
Google Cloud Vertex AI
Microsoft Foundry

The model ID is claude-opus-4-7. If you're currently using claude-opus-4-6 in your API calls, you'll need to update the model parameter. Anthropic hasn't announced a deprecation date for Opus 4.6 yet.

Cost Considerations

The unchanged per-token pricing is good news, but watch the tokenizer change. Run a few representative prompts through both models and compare token counts before switching production workloads. If your inputs are heavy on code or structured data, the 1.35x token increase could be significant.

That said, Hex's finding that low-effort Opus 4.7 matches medium-effort Opus 4.6 suggests you might offset tokenizer costs by using lower effort levels.

What to Watch For

Opus 4.7 looks strong on paper, but a few things warrant caution. The tokenizer change is the most immediate concern. Any cost projection based on Opus 4.6 token counts needs recalculation.

Migration Risks

The improved instruction following is a double-edged sword. If your prompts relied on Opus 4.6's specific interpretation quirks, the upgrade might change outputs in unexpected ways. Test thoroughly before switching production systems.

The Mythos Question

Opus 4.7 sits below the restricted Mythos Preview model. That means Anthropic has a more capable model it hasn't released broadly. How long before Mythos goes GA? And will its release make Opus 4.7 the "budget" option? These are open questions worth tracking.

The fact that Anthropic chose partner benchmarks over academic ones for this launch suggests a deliberate repositioning. They're competing on "useful in production" rather than "highest score on MMLU." That's a healthier framing for developers evaluating models.

Context Window Reality

The 1M token context window is impressive, but effective retrieval at that scale is still inconsistent across all frontier models. Don't assume 1M tokens means 1M tokens of equally reliable context. Test retrieval accuracy at different positions in your specific use case.

Frequently Asked Questions

Is Claude Opus 4.7 worth upgrading from Opus 4.6?

For agentic coding workflows, yes. The 12-percentage-point CursorBench improvement and 3x Rakuten-SWE-Bench gains represent meaningful capability jumps (Anthropic, 2026). For simple prompt-response use cases, the improvement may not justify migration effort. Test with your actual workloads before committing.

Does Opus 4.7 cost more than Opus 4.6?

Per-token pricing is identical at $5/$25 per million tokens. However, the updated tokenizer generates 1.0-1.35x more tokens for the same input, which could increase effective costs by up to 35% for some workloads (Anthropic, 2026). Lower effort levels may offset this.

How does Opus 4.7 compare to competing models?

Anthropic focused partner benchmarks on comparisons with Opus 4.6 rather than competitors. The 98.5% XBOW visual-acuity score and 70% CursorBench result are strong, but direct comparisons to GPT-4.5 or Gemini Ultra require independent third-party testing that isn't available yet.

What is the self-verification feature?

Self-verification is Opus 4.7's ability to check its own outputs before returning them. Vercel's Joe Haddad described it as "new behavior we haven't seen from earlier Claude models" (Anthropic, 2026). In practice, it reduces incorrect code generations and inconsistent outputs during long agentic sessions.

See also: Claude Design by Anthropic (powered by Opus 4.7) and Claude Mythos (Anthropic's restricted frontier model).

Claude Opus 4.7 is available now through the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. Model ID: claude-opus-4-7.