
This model is kind of a disaster.
Audio Summary
AI Summary
Opus 4.7 from Anthropic has been released for public use, and while it shows some promising improvements, it also presents significant frustrations and inconsistencies. The model is not Anthropic's most powerful, but it is their best public release. Initial impressions were positive, but prolonged use revealed a concerning regression in performance.
Opus 4.7 is presented as a notable improvement over Opus 4.6, particularly in advanced software engineering tasks. Users are reportedly able to confidently hand off difficult coding work that previously required close supervision. The model is said to handle complex, long-running tasks with rigor and consistency, paying precise attention to instructions and devising ways to verify its own outputs. It also boasts substantially better vision, processing images at higher resolutions, and showing more creativity and taste in professional tasks like creating interfaces, slides, and documents. While less broadly capable than the Cloud Mythos preview, it generally outperforms Opus 4.6 across a range of benchmarks.
However, the claim of "a range of benchmarks" is significant, as Opus 4.7 actually performs worse than Opus 4.6 on several benchmarks, including the Agentic Search benchmark. This aligns with personal experiences of the model making "weird and questionable search decisions." On benchmark charts, Opus 4.7 shows the fewest "best scores" compared to other models, with its top performances often occurring in categories where Mythos data is unavailable. It shows better agentic coding performance across SBench Pro and Verified benchmarks, although the reliability of these benchmarks has been questioned due to data contamination. The model performed well in the MCP Atlas bench but slightly worse in cybersecurity vulnerability reproductions.
A key aspect of Opus 4.7's release is the implementation of new cyber safeguards. Anthropic stated their intention to test these safeguards on less capable models before a broader release of Mythos-class models. Opus 4.7 is the first such model, with its cyber capabilities intentionally reduced during training. These safeguards automatically detect and block requests deemed prohibited or high-risk for cybersecurity uses. This aggressive filtering led to frustrating experiences, such as the model misinterpreting system reminders as "prompt injection" or "malware" when asked to analyze a personal website.
To use Opus 4.7 for legitimate cybersecurity purposes like vulnerability research or penetration testing, security professionals must join a "cyber verification program" by filling out a form to gain permission for certain types of code-related queries. This restrictive approach is seen as "silly" and counterproductive.
Opus 4.7 is available across all Claude products and APIs, including Amazon Bedrock, Google Cloud Vert.Ex AI, and Microsoft Foundry, with pricing remaining consistent with Opus 4.6.
One of the most touted improvements is instruction following. The model is said to be "substantially better at following instructions," leading to a literal interpretation of prompts. This means prompts written for older models, which interpreted instructions loosely, might now produce unexpected results, requiring users to "retune their prompts and harnesses accordingly."
Improved multimodal support is another highlight, with Opus 4.7 accepting higher resolution images (up to 2576 pixels on the long edge, or 4 megapixels), three times more than previous Claude models. This enhances its utility for tasks requiring fine visual detail, such as computer use agents reading dense screenshots, data extraction from complex diagrams, and pixel-perfect references. While Anthropic is catching up in this area, Google is still considered the leader in image recognition.
The model also shows state-of-the-art performance in finance agent evaluation and economically valuable knowledge work across finance, legal, and other domains. It is reported to be better at using file-system-based memory, remembering notes across long, multi-session work, and requiring less upfront context for new tasks. It also exhibits slightly less "misaligned" behavior than Opus 4.6, though not as good as Mythos preview.
New features in Claude Code include an "X high" effort level, similar to OpenAI's models, and a new "ultra review" slash command for dedicated review sessions that flag bugs and design issues. These reviews are expected to be costly.
Despite these claimed improvements, practical use reveals significant problems. The aggressive security filters can prematurely halt legitimate tasks, such as solving cryptographic puzzles, labeling them as "flagged" and forcing a switch to a "very, very dumb model" like Sonnet 4. This indicates that while the system prompt adjustments aim for safety, they often result in the model becoming "dumber."
A critical issue is Opus 4.7's inability to perform basic web searches or stay updated with the latest information. When asked to modernize a codebase, it failed to identify the latest versions of frameworks like Next.js, sticking to older versions (e.g., Next.js 15 instead of 16) because its training data was outdated and it didn't search for current information. This highlights a fundamental flaw: while it follows instructions literally, it doesn't perform necessary reconnaissance or verification, leading to "dumb mistakes."
The model's performance in code generation is inconsistent. It might produce a concise, well-structured plan, but then fail to execute it correctly, often due to a lack of understanding of the underlying harness or internal tools. For instance, it repeatedly failed to update a `package.json` file because it didn't grasp the required "read first" permission step within the Claude Code harness.
A significant "hot take" is that the observed regressions might not be in the model's core intelligence but rather in the "shitty and poorly maintained" Claude Code harness. Constant additions of "slop," system prompt modifications, broken tools, and new rules are seen as degrading the environment in which the model operates. This is exacerbated by Anthropic employees reportedly using an entirely different internal stack of tools and harnesses, leading to a disconnect between their internal experience of the model's capabilities and the degraded experience of public users.
This sentiment is echoed by other users, with reports of Claude models "regressing day after day," failing to perform web searches, and refusing to do paid work. In contrast, OpenAI models are generally perceived as more consistent and stable, with any regressions being transparently communicated and fixed by their engineering teams.
Ultimately, Opus 4.7 is described as "the weirdest model either labs released in a while." It can tackle complex bugs requiring extensive file changes but simultaneously make simple, "boneheaded" errors. The consistency observed in older models like Opus 4.5 is gone, replaced by a wider range of output quality. While it can achieve "crazy great things," it also produces "slop."
Despite the frustrations, for users already subscribed to Anthropic's services, there's little harm in switching from Opus 4.6 to 4.7 to test it, as individual experiences may vary. However, the overall impression is one of a model that, while having potential, is hampered by inconsistencies and a degraded user experience due to underlying software issues.