
Did Claude really get dumber again?
AI Summary
This video discusses a perceived regression in Claude's performance, particularly with Claude Opus 4.7 and its coding capabilities. The presenter and others, including the AI director from AMD, have observed that Claude models, especially Opus 4.6 and 4.7, seem to be performing worse than previous versions. This is not attributed solely to the models themselves becoming "dumber" in a traditional sense, but rather a complex interplay of factors affecting user experience.
The issues manifest in various ways: task refusals where the model outright refuses a request or the API blocks it; "dumber solutions" where the model provides incorrect code or fails to follow task intent; and "getting lost" where the model loses track of the user's request or misinterprets past instructions. Quantified evidence from Margin Labs shows a consistent dip in model performance benchmarks from March onwards, indicating a meaningful decline.
The video explores potential causes across several layers of the Claude ecosystem.
**The Harness:** The "harness" is a crucial intermediary layer that shapes system prompts, defines available tools, and manages model interactions. The presenter argues that the harness, particularly within Claude Code, is poorly engineered. Examples include the model needing to explicitly "read" a file before editing it, even if it already knows the contents, leading to unnecessary API calls and token waste. This inefficient engineering is estimated to have cost millions in unnecessary inference. A benchmark comparing Opus in Claude Code versus Cursor showed Opus performing 15% worse in Claude Code, highlighting the harness's negative impact. The speaker suggests that even minor changes to the system prompt within the harness can significantly degrade model performance.
**The API and Routing:** Changes at the API level can also contribute to regressions. Routing errors, where requests are sent to servers configured for different model versions (like the 1 million token context window model), have been identified. Anthropic itself has acknowledged that the 1 million token context version of the model behaves dumber. The presenter theorizes that Anthropic might be routing traffic away from NVIDIA GPUs to AWS Tranium and Google TPUs, and enabling the 1 million token context by default across all users is a way to achieve this, even though this version is less performant. This default setting means most users are likely experiencing a dumber model without realizing it, as disabling it requires manual configuration.
**Tokenization:** Tokenization, the process of converting text into tokens for the model to process, has also changed. Opus 4.7 uses an updated tokenizer that, while improving text processing, maps the same input to more tokens (a 1x to 1.47x increase). This bloats the context, increasing token usage and potentially leading to "context rot" where the model gets bogged down by irrelevant information, making it act dumber, analogous to a human trying to find a bug in a much larger file. This change to a minor version update is considered unusual.
**Compute Hardware and Model Serving:** Anthropic uses a mix of hardware for inference, including NVIDIA GPUs, AWS Tranium, and Google TPUs. The presenter suggests that requests might be routed to different hardware for each step of a multi-turn interaction, leading to inconsistent behavior and potential errors. This diversity in compute platforms, coupled with the complexity of ensuring equivalent performance across them, is a significant challenge.
**The Model Itself and User Expectations:** While the presenter believes much of the regression stems from external factors, the models themselves are also updated. The shift from Opus 4.6 to 4.7 is noted as a period where many users observed regressions. However, the presenter also acknowledges that user expectations have risen. Tasks that once seemed impressive now appear mundane, making failures at a higher baseline feel like regressions.
**Redacted Thinking and Compute Reduction:** A significant concern is Anthropic's decision to redact the model's "thinking" process, which was previously visible in API responses. This change, implemented to prevent "distillation" (where other models learn from Claude's thought process), means users no longer see the intermediate steps the model takes. The presenter hypothesizes this is a move to reduce compute costs, as less thinking requires less processing. This redaction, which has gone from 1.5% visible to 100% redacted since March, correlates with measurable behavioral impacts. These include increased "stop violations" (preventing laziness), more user frustration, a shift from research-first to edit-first behavior, and a significant increase in API requests and tokens consumed for demonstrably worse results. The AMD AI director's report highlights a decline in "thinking depth" by 73% and a drastic change in the read-to-edit ratio, indicating a less thorough and more action-oriented, potentially less accurate, model.
**OpenAI Comparison:** In contrast to Anthropic, the presenter notes that OpenAI models do not appear to suffer from the same consistent, long-term regressions. While temporary issues can occur with new model releases, they are typically resolved quickly, suggesting a more stable engineering approach.
In conclusion, the video argues that Claude's perceived performance degradation is a multifaceted problem stemming from poor engineering in the harness, inefficient API configurations, problematic tokenization, diverse compute environments, and strategic decisions like thinking redaction aimed at reducing compute costs. The presenter believes Anthropic's engineering culture and execution are at the root of these issues, leading users to experience a less capable and more frustrating AI. The suggestion is to consider alternatives if reliable performance is crucial.