
Beyond Bigger Models: Recursion As The Next Scaling Law In AI
Audio Summary
AI Summary
This episode of Decoded features YC visiting partner Francois Shaard discussing the trend of recursion in AI research, specifically how it can enhance model reasoning at inference time, as an alternative to simply increasing model size. Two key papers from 2025, Hierarchical Reasoning Models (HRM) and Tiny Recursive Models (TRM), are highlighted for demonstrating this approach.
The discussion begins by contrasting Recurrent Neural Networks (RNNs) with Large Language Models (LLMs). RNNs, which were once thought to be crucial for Artificial General Intelligence (AGI), involve recursively calling a model on itself. A major limitation of RNNs is "backprop through time," where accumulating errors and noisy gradients arise from rolling out the model over many steps, especially with long contexts. This leads to vanishing or exploding gradient problems and requires retaining activations at every step, posing significant memory challenges.
LLMs, on the other hand, employ a one-shot feed-forward process during training. The transformer block in LLMs processes inputs in parallel, avoiding the need to store numerous activations and the vanishing gradient issue associated with RNNs. This is achieved through a causal mask, allowing all time steps to be processed in one go. However, LLMs sacrifice latent reasoning and compression in the time direction. Unlike RNNs, which compress information into a hidden state, LLMs require retaining the entire input context for each decode, leading to a lack of compression.
A critical limitation of LLMs, according to the discussion, lies in their reasoning abilities. While LLMs are adept at next-token prediction, they struggle with tasks requiring complex, multi-step reasoning. For instance, sorting an unsorted list, theoretically requiring at least O(n log n) steps, becomes impossible for a transformer if it runs out of layers to perform the necessary comparisons. This limitation is likened to a Turing machine, where a standard feed-forward model with a fixed number of layers has a limited computational capacity. While LLMs can be made "Turing complete" at test time through techniques like chain-of-thought prompting or tool use, this relies on pre-existing knowledge or explicit instruction, rather than inherent reasoning discovery. The speaker argues that chain-of-thought is a hack that relies on historical knowledge, and tool use is limited to known functions, not discovering new ones from first principles.
The conversation then delves into the HRM paper. This model is presented as being in the lineage of RNNs, incorporating a bio-inspired idea of different brain parts operating at different frequencies. HRM features a hierarchical structure with lower and higher-level modules and an outer refinement step, creating three levels of recursion. The key innovation in HRM's training is its departure from standard backpropagation through all recursion steps. Instead, it employs a technique inspired by Deep Equilibrium Learning (DEQ). Instead of resetting hidden states to zero for each batch, HRM performs multiple forward passes on the same input, allowing the hidden states (referred to as ZL and ZH, or "carry" and "task carry") to evolve. This process is framed as constructing mini-batches from different memory states rather than different inputs. The paper uses a "stop grad" operation to truncate backpropagation, preventing it from going all the way back through the recursion, thus mitigating the backprop-through-time issues. This truncated backpropagation, specifically to a single step of the higher-level module (hnet), is found to be surprisingly sufficient.
The discussion touches on the bioplausibility of these models. While biological inspiration can spark ideas, the focus in machine learning often shifts to what works best computationally, especially on GPUs. The speaker finds the connection to automata theory and fundamental data structures and algorithms more compelling, viewing the model's hidden states as akin to a Turing machine tape or a radix sort memory bank, allowing for efficient computation through a learned memory cache.
The TRM paper is presented as a simplification and improvement upon HRM. A key finding from HRM was that the outer refinement loop was the most crucial component for performance scaling. TRM retains this outer refinement loop but simplifies the architecture. It collapses the distinct lower and higher-level networks (lnet and hnet) into a single network ("net") that shares weights. While HRM used four transformer layers for both lnet and hnet, TRM uses just one transformer layer, and on some tasks like Sudoku, even a simple Multi-Layer Perceptron (MLP) performed comparably or better than the transformer. TRM also simplifies the backpropagation by backpropagating through one full latent recursion step, rather than just one call to the higher-level module as in HRM. This allows for deeper recursion without the full backprop-through-time penalty.
A significant outcome of TRM is its ability to achieve state-of-the-art performance on tasks like ARC Prize with a much smaller model size (7 million parameters compared to HRM's 28 million parameters). This highlights the principle that increasing recursion can be as effective, and sometimes more so, than simply increasing model size. The paper suggests that for complex problems, larger models might be necessary, but recursion offers a powerful avenue for achieving high performance with more efficient architectures.
The discussion then moves to the broader implications for AI research. The importance of recursion is emphasized, suggesting it's a lasting trend. The concept of the outer refinement loop and truncated backpropagation is highlighted as a powerful idea that warrants further exploration. The potential of combining the strengths of these small, recursive models with the vast knowledge encoded in giant LLMs is seen as a major future direction. The current LLMs are effective at finding good embedding representations, but the reasoning within those latent spaces is often limited. The proposed future involves using small, recursive models to perform reasoning within these rich latent spaces discovered by larger models, leading to highly efficient architectures for scaled-up reasoning.
Finally, the distinction between these task-specific recursive models and general-purpose LLMs is noted. While LLMs are designed for broad applicability, HRM and TRM were trained for specific tasks. The exciting prospect lies in developing more general-purpose agents that can leverage the reasoning capabilities of recursive models within the broad contextual understanding provided by LLMs, enabling efficient and powerful AI systems.