
I don’t really like GPT-5.5…
Audio Summary
AI Summary
The speaker discusses OpenAI's latest model, GPT 5.5, expressing a mix of excitement and disappointment, noting it's not their favorite release despite its power. The model comes with a significant price increase, costing $5 per million tokens in and $30 per million tokens out, which is twice the price of GPT 5.4 and 20% higher than Opus 4.7. While acknowledging its improved token efficiency, the price hike is substantial.
OpenAI describes GPT 5.5 as their "smartest and most intuitive model yet," emphasizing enhanced safeguards to prevent misuse while facilitating beneficial work. They conducted extensive evaluations, including internal and external red teaming, and targeted testing for advanced cybersecurity and biology capabilities, gathering feedback from nearly 200 early access partners. The model's larger size, which typically impacts speed, has been addressed through optimization, partly due to a partnership with Nvidia and the use of their latest GB200 NVL72 systems. API support for GPT 5.5 is not yet available, though it is expected soon, and workarounds exist for early testing.
Benchmarking results show GPT 5.5 outperforming previous models. On the Terminal Bench, it scored 82.7% compared to GPT 5.4's 75.1%. Its internal SWE bench score was 73%, up from 68.5%. In GDP val, it achieved 84.9% wins or ties, slightly better than GPT 5.4's 83.0%, although the speaker found this benchmark somewhat misleading due to an increase in ties rather than outright wins. It also performed well on OSWorld Verify, beating Opus 47 by 7%, and on Toolathon. For browse comp, the Pro model of 5.5 achieved 79.3%, surpassing Opus, but the non-Pro version scored 84.4%, which was lower than Google's 3.1 Pro. The speaker suggests that the Pro model's superior performance in browse comp might be due to better tool calls and coherency over long runs, rather than visual recognition, where Google models still excel. GPT 5.5 also showed strong performance in Frontier Math and Cyber Gym.
A key insight from the Artificial Analysis Intelligence Index benchmark is the model's token efficiency. GPT 5.5x high used 75 million tokens, roughly half of GPT 5.4's usage and significantly less than Claude Opus 46/47. The high and medium versions of GPT 5.5 were even more efficient, using 45 million and 22 million tokens, respectively, for similar intelligence levels to previous high-performing models. This efficiency, despite higher per-token costs, makes the "medium" intelligence level of GPT 5.5 roughly the same price as GPT 5.4x high. OpenAI strongly recommends using lower reasoning levels (low and medium) unless absolutely necessary, a departure from previous recommendations.
Regarding frontend capabilities, the speaker noted an improvement in 3D understanding, despite the models' tendency to include "cards" in generated designs. The speaker demonstrated this by having GPT 5.5 modernize an old 2D game called "Fish Slop." While GPT 5.5 produced a significantly better and more visually appealing version than previous models, it initially struggled to fully grasp the intent of making it a truly 3D game, instead just replacing assets with 3D ones in a 2D experience. This highlighted a recurring issue for the speaker: the model tends to fulfill requests "just barely" rather than fully honoring the underlying intent.
Other testers and companies have largely praised GPT 5.5. Michael Troll from Cursor noted its improved persistence and stronger coding performance, handling complex, long-running tasks more reliably. Lovable highlighted its ability to tackle complex tasks like auth flows and real-time syncing with fewer iterations. Cognition stated it sets a new bar for development, catching bugs and fixing production issues end-to-end. These positive reviews underscore the model's impressive capabilities in writing code and solving complex problems.
However, the speaker's main dissatisfaction stems from the model's "laziness." It often feels like it's trying to complete a task minimally without fully addressing the intent. A significant problem is its inability to discard incorrect information once it enters the context window. If the model goes down the wrong path or finds incorrect information during research, it will persistently fall back on that data, even if explicitly told to stop. This often necessitates starting a new thread, leading to a frustrating experience with long-running conversations. While the model is impressive in what it can achieve within its 400k token window, the need to frequently restart threads due to context issues is a major drawback.
The Pro version of GPT 5.5, accessible through the ChatGPT site during testing, proved exceptionally powerful. The speaker successfully used it to solve three unsolved Defcon puzzles that had stumped experts for 5-10 years. While one of these was also solvable by GPT 5.4 Pro, the new model's overall capability in tackling complex ciphers and challenges was remarkable. An example was a two-part cipher puzzle that GPT 5.5 Pro solved in 163 minutes. When provided with an accidental hint (a link to a gist), it solved the same puzzle in under five minutes by leveraging the GitHub link to find relevant information.
In conclusion, the speaker emphasizes that while GPT 5.5 is the "smartest model ever made" and produces the best AI-generated code, users need to adapt their approach. This involves providing more direct and detailed prompts upfront, conducting preliminary research in separate threads to supply the model with correct resources, and frequently starting new threads to avoid issues with persistent incorrect context. The model's difficulty in managing long-running threads and its tendency to "barely" fulfill intent make it a different experience from previous models. The speaker suggests it should have been named GPT 6 due to these significant operational differences, requiring users to rethink their traditional methods to leverage its full potential.