(Paper Review) Recursive Language Models

in review

@po_oamen|October 15, 2025 (5m ago)314 views

MIT's Recursive Language Models solve context rot by letting models strategically interact with information through code rather than processing everything at once. A smaller model using RLM outperforms a larger base model by 33% on long-context tasks because it decides how to chunk, search, and recursively process context instead of trying to hold everything in memory simultaneously.

Recursive Language Models: A Strategic Approach to Long-Context Processing

Original research: Zhang & Khattab (2025)

The challenge of context degradation in production systems

Consider a common scenario in modern AI deployment: you're working with a language model - Claude, ChatGPT, Grok, or any frontier system - on an extended task. Perhaps you're developing a complex software architecture, analyzing a large dataset, or working through a multi-faceted research problem. After an hour of productive back-and-forth, you begin to notice a subtle but unmistakable shift in the model's responses. The quality degrades. It misses connections to earlier parts of the conversation. Details you established twenty exchanges ago seem to have vanished from its reasoning.

This phenomenon is not a matter of exceeding context window capacity. Modern frontier models support context windows of hundreds of thousands of tokens, with some claiming millions. The technical headroom exists. Yet performance degrades nonetheless, and in ways that standard benchmarks fail to capture adequately. The research community refers to this as "context rot" - formally defined by Anthropic as occurring "when the number of tokens in the context window increases, the model's ability to accurately recall information from that context decreases" [1].

However, this definition only partially captures the observed behavior. Frontier models achieve over 90% accuracy on needle-in-a-haystack benchmarks like RULER [2], even with year-old architectures. They can locate specific information buried in massive documents with high reliability. Yet in production usage - extended conversations, complex multi-document analysis, or iterative problem-solving - something fundamental breaks down. The degradation manifests not as catastrophic failure but as a gradual erosion of coherence, context awareness, and reasoning quality.

This discrepancy between benchmark performance and real-world behavior suggests that context rot encompasses more than simple information retrieval. It reflects deeper limitations in how models process, maintain, and utilize extended contexts during inference.

A paradigm shift in long-context processing

Recent work by Zhang and Khattab at MIT [3] introduces Recursive Language Models (RLMs), a framework that approaches this problem from a fundamentally different angle. Rather than attempting to make models better at processing enormous contexts in single forward passes - the dominant research direction of the past several years - RLMs ask a different question: why must any single model call process the entire context at all?

The insight is elegant. In standard LLM deployment, we essentially force the model to hold everything in working memory simultaneously. The architecture looks like this:

Model(massive_context + query) → answer

We provide the full context - potentially hundreds of thousands of tokens - alongside the user's query, the model processes all of it in a single forward pass, and produces an answer. It's analogous to asking someone to memorize an entire library, hold all that information in active memory, and then answer a question about one specific passage in one specific book.

RLMs invert this paradigm:

Model(query) → strategic_interaction(context) → answer

The model receives only the query. The context lives in a separate computational environment - specifically, a Python REPL (Read-Eval-Print Loop) environment, similar to Jupyter notebooks - where it exists as an accessible variable. The model can then write code to inspect that context strategically: peeking at structure, searching for patterns, partitioning into manageable chunks, and spawning recursive calls to itself to process different pieces. The model becomes an active agent deciding how to interact with information, rather than a passive consumer forced to process everything at once.

Architectural design and emergent behaviors

The RLM architecture introduces two key design choices that enable this paradigm. First, treating the context as a Python variable allows programmatic interaction. The model can use standard programming constructs - indexing, slicing, pattern matching, filtering - to navigate and extract relevant information. This transforms the problem from "process all tokens" to "write code that efficiently finds and processes relevant tokens."

Second, enabling the model to invoke recursive calls to itself (or smaller models) creates a natural mechanism for parallel processing and hierarchical decomposition. When faced with a task requiring semantic understanding across thousands of entries, the model can partition the data, spawn recursive calls to handle each partition, and aggregate results. Each recursive call operates on a focused subset, avoiding the context rot that would affect a single call processing everything.

Crucially, the researchers did not program specific strategies for context interaction. They provided the tools - code execution, context access, recursive invocation - and observed what strategies emerged naturally. The model developed four primary patterns:

Inspection (peeking): Before committing to a processing strategy, the model examines a small sample of the context to understand its structure. Similar to how a programmer begins data analysis by inspecting the first few rows, the model typically requests the first ~2000 characters. This reconnaissance informs subsequent decisions about parsing, filtering, and decomposition.

Pattern-based filtering: For many tasks, the model can narrow the relevant context substantially before semantic processing. Rather than reading and understanding every entry, it uses regular expressions or keyword matching to identify candidates. For example, if a query specifies particular user IDs, the model writes code to grep for those IDs, then processes only matching entries. This reduces the semantic processing burden by orders of magnitude.

Partition-and-map operations: When semantic understanding is unavoidable - classifying questions by type, extracting specific information, making judgments - the model partitions the context into chunks of manageable size. It spawns recursive calls, each receiving one chunk and a focused sub-query. These recursive instances process their assigned portions independently, return structured results, and the root call aggregates them. This pattern mirrors MapReduce paradigms in distributed computing.

Adaptive summarization: The model naturally employs summarization when appropriate, but critically, it decides when and how to summarize based on the query and context structure. This represents a learned generalization of manual summarization heuristics, with the model adapting its strategy to specific circumstances rather than following rigid predetermined rules.

The emergence of these strategies without explicit training suggests they reflect efficient solutions to the underlying computational problem. The model, given tools and flexibility, converges on patterns that minimize redundant processing, leverage cheap operations before expensive ones, and decompose large problems into parallelizable subproblems.

Empirical validation on challenging benchmarks

Zhang and Khattab evaluate RLMs on benchmarks specifically selected for their difficulty with current frontier models. Standard long-context benchmarks, they found, are often solvable by recent models, sometimes even without the provided context. The chosen benchmarks present genuine challenges:

OOLONG (trec_coarse split): This benchmark provides contexts of 3,000-6,000 entries (128k-263k tokens) and poses distributional queries requiring semantic classification coupled with numerical aggregation. A representative query might read: "Among entries associated with user IDs [list of 20 IDs], how many should be classified as label 'entity'?" The model must semantically understand each entry well enough to classify it, filter by user ID constraints, and accurately count results. The combination of semantic and arithmetic operations, performed over thousands of entries without preprocessing, stresses both comprehension and attention mechanisms.

Results demonstrate substantial performance gains:

Base model (small): ~30% accuracy
Base model (large): ~35% accuracy
RLM (small model): over 65% accuracy

The RLM approach using a smaller model achieves a 33%+ absolute improvement over the larger base model - more than doubling performance. This occurs at approximately equivalent API cost, as the increased number of smaller calls balances the expense of fewer large calls. Ablation studies removing recursive capabilities show ~10% performance degradation, confirming that the recursive decomposition contributes meaningfully rather than serving merely as scaffolding.

BrowseComp-Plus: This benchmark evaluates multi-hop retrieval over document collections of varying sizes. Queries require connecting information across multiple sources: "I am looking for a trading card released between 2005-2015, with multiple rarities, used by a Japanese world champion, that served as armor for another card released 2013-2018, was once banned, and is below level 8. What is this card?" Each clue potentially exists in a different document, and determining correctness requires cross-referencing multiple sources.

The scaling behavior is particularly revealing:

Method	10 docs	100 docs	1000 docs
Base Model	~95%	~85%	~75%
ReAct + BM25	~95%	~88%	~85%
RLM (full)	~100%	~100%	~100%

Traditional approaches exhibit clear degradation as document count increases. RLM maintains perfect accuracy across all scales tested, including the 1000-document condition representing over 10M tokens of context. Importantly, this occurs without traditional retrieval indexing - no BM25, no vector embeddings, no preprocessing. The model determines its own retrieval strategy dynamically through the code it writes and the recursive structure it constructs.

Fundamental distinction from agent-based approaches

The superficial similarity to agent systems - multiple LLM calls coordinating to solve a task - obscures a fundamental difference in problem formulation. Agent systems typically decompose tasks: breaking a high-level goal into sub-tasks based on human intuition about problem structure. If the goal is booking a flight, an agent might decompose this into: search flights, compare prices, check calendar conflicts, proceed to checkout. This task decomposition is usually designed by humans who understand the domain.

RLMs decompose context, not tasks. The model does not reason about "what are the steps to solve this problem?" but rather "how should I strategically interact with this information?" It determines how to partition the context, what to examine first, where to search, when to recurse, and how to aggregate partial results. The decomposition is data-driven rather than task-driven, and crucially, it is determined by the model based on the specific structure of the query and context it encounters.

From the user's perspective, an RLM call appears identical to a standard model call - model.completion(query) - but the execution strategy differs fundamentally. The user is not designing a workflow, specifying agents, or orchestrating calls. They are simply allowing the model to be strategic about information consumption.

The appropriate analogy distinguishes between a team following a predetermined workflow (agents) versus a skilled researcher with access to a well-organized library (RLMs). The researcher can skim catalog entries, pull specific books, cross-reference sections, and recruit assistance for parallel work, all decided adaptively based on the research question. Both paradigms have utility, but they address different challenges.

Interpretability and debuggability advantages

A significant practical advantage of the RLM framework is interpretability. The researchers developed a visualization tool that traces the sequence of operations: the code the model writes, the recursive calls it spawns, the data each call processes, and how results aggregate. This provides insight not merely into what answer the model produced, but how it decided to find that answer.

When an RLM produces an incorrect result, the execution trace enables debugging. Did it grep for the wrong pattern, filtering out relevant context? Did it partition inappropriately, creating imbalanced chunks? Did a recursive call misclassify data? The programmatic nature of the approach makes failure modes inspectable in ways that standard model calls do not permit.

This contrasts sharply with standard LLM deployment, where debugging consists primarily of prompt refinement and hope. With RLMs, you can understand the information-processing strategy, identify where it failed, and potentially guide improvements through modified prompts, examples, or environmental constraints.

Performance characteristics and optimization opportunities

The current implementation prioritizes proof-of-concept over production optimization. Each recursive call executes sequentially (blocking on completion before proceeding), no caching is implemented for repeated patterns, and no mechanisms exist to impose cost or latency budgets. Depending on the model's chosen decomposition strategy, queries require anywhere from seconds to several minutes.

However, this presents opportunity rather than limitation. For systems researchers, the lack of optimization represents low-hanging fruit. Obvious improvements include:

Asynchronous execution: Recursive calls that operate on independent context partitions can execute in parallel rather than sequentially.
Prefix caching: Repeated patterns in code or context can be cached to avoid redundant computation.
Budget constraints: Cost and latency limits can be imposed, forcing the model to work within resource bounds.
Speculative execution: Likely branches in the recursion tree can be evaluated speculatively.

More fundamentally, the strategies models use to interact with context are learnable through training. This represents another axis of test-time compute scaling, analogous to how we now train reasoning models to employ chain-of-thought [4]. Reinforcement learning could optimize models to develop more efficient recursion strategies: better partition schemes, more selective filtering, adaptive depth decisions. The key insight is that no single model call requires handling enormous contexts. Training focuses on strategic decomposition rather than brute-force capacity.

Scaling properties and future trajectories

Long-context capability has historically been framed as a model architecture problem. Researchers developed techniques like ALiBi [5], YaRN [6], and similar positional encoding methods to extend context windows. Subsequently, the community identified systems bottlenecks: the quadratic complexity of attention mechanisms, though empirical work revealed that multi-layer components often dominate cost [7]. The current understanding involves messy interactions between architecture, systems optimization, and the fundamental challenge that very long contexts are out-of-distribution for model training.

RLMs sidestep this entire landscape. Rather than solving context rot directly, they avoid triggering it. Instead of making models better at processing million-token contexts, they enable strategic interaction with arbitrary context sizes without requiring any single call to handle the full scope.

This approach exhibits favorable scaling properties. As base models improve, RLM capabilities scale proportionally - and potentially super-linearly. If tomorrow's frontier model handles 10M tokens effectively, an RLM built atop it can handle 100M tokens, likely at lower per-token cost. The framework multiplies whatever progress occurs in base model quality, without requiring architectural changes or retraining.

Zhang and Khattab frame this as a fundamentally different bet than agent-based systems: "Agents are designed based on human intuition about how to break down problems. RLMs are designed based on the principle that models should decide how to break down problems in ways that are digestible for them." The outcome remains uncertain, but the preliminary evidence suggests RLMs capture something meaningful about efficient information processing under resource constraints.

Implications and open research directions

This work reframes the long-context challenge in productive ways. The dominant research question has been "how do we make models handle longer contexts without degrading?" - treating it as an engineering problem to overcome through better architectures or systems. Zhang and Khattab pose a different question: "why must any single model call handle enormous contexts?" The reframing leads to qualitatively different solutions.

Several important research directions emerge:

Training for recursion: Current results use models without specific training for recursive decomposition strategies. Explicitly optimizing for recursion through reinforcement learning, using downstream performance as reward, could yield substantial improvements. The space of possible decomposition strategies is large, and current models likely explore only a small region.

Theoretical analysis: What are the computational complexity properties of RLMs compared to traditional approaches? Under what conditions does recursive decomposition provably reduce the required context per call? Can we characterize the class of problems amenable to this approach?

Hybrid approaches: RLMs and traditional retrieval methods are not mutually exclusive. Combining dense retrieval for initial filtering with RLM-based processing of retrieved documents might offer advantages over either approach alone.

Generalization beyond text: The core principle - strategic interaction with large data structures through programmatic tools and recursion - extends beyond text. Can similar approaches apply to large-scale structured data, knowledge graphs, or multi-modal contexts?

Formal guarantees: Can we provide bounds on cost, latency, or quality? Can models learn to estimate their own confidence in recursion strategies, perhaps determining when to recurse versus process directly?

The broader implication is methodological. RLMs demonstrate that creative problem reformulation - asking different questions rather than harder optimizing existing formulations - can yield substantial progress. The fact that smaller models using this framework outperform larger models on genuinely challenging benchmarks suggests the approach captures something real about efficient information processing.

Conclusion

Recursive Language Models represent a conceptual advance in long-context processing for language models. Rather than treating extended contexts as a scaling problem to be overcome through larger architectures or better systems, RLMs reframe it as a strategic interaction problem. By enabling models to programmatically interact with context through code, spawn recursive sub-queries, and adaptively determine processing strategies, the framework achieves substantial performance improvements while sidestepping the context rot phenomenon that plagues traditional approaches.

The empirical results are compelling: smaller models using RLM strategies outperform larger base models by 33%+ on challenging benchmarks, maintain perfect accuracy while traditional methods degrade, and handle contexts exceeding 10M tokens effectively. The interpretability advantages, emergent strategies, and favorable scaling properties suggest this paradigm has significant potential for production deployment.

Whether RLMs represent the final solution to long-context processing or constitute one step in a longer research trajectory remains to be determined. However, the work demonstrates that creative reformulation of fundamental assumptions - in this case, questioning the necessity of processing entire contexts in single calls - can yield meaningful progress on challenges that appear intractable under existing paradigms.

This review is based on the authors' blog post and pre-publication materials. The full paper is forthcoming, and specific implementation details may be refined in the published version.

References

[1] Anthropic, "Effective context engineering for AI agents," Anthropic Engineering Blog, 2024. [Online]. Available: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

[2] C.-Y. Hsieh et al., "RULER: What's the real context size of your long-context language models?" arXiv preprint arXiv:2404.06654, 2024.

[3] A. Zhang and O. Khattab, "Recursive Language Models," MIT OASYS Lab, October 2025. [Online]. Available: https://alexzhang13.github.io/blog/2025/rlm/

[4] J. Wei et al., "Chain-of-thought prompting elicits reasoning in large language models," in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS), 2022.

[5] O. Press, N. Smith, and M. Lewis, "Train short, test long: Attention with linear biases enables input length extrapolation," in Proc. Int. Conf. Learning Representations (ICLR), 2022.

[6] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong, "Random feature attention," in Proc. Int. Conf. Learning Representations (ICLR), 2021.

[7] N. Shazeer, "Fast transformer decoding: One write-head is all you need," arXiv preprint arXiv:1911.02150, 2019.