Working with Contexts – O’Reilly
The next article comes from two weblog posts by Drew Breunig: “How Long Contexts Fail” and “How to Fix Your Contexts.”
Managing Your Context Is the Key to Profitable Brokers
As frontier mannequin context home windows proceed to develop,1 with many supporting as much as 1 million tokens, I see many excited discussions about how long-context home windows will unlock the brokers of our goals. In any case, with a big sufficient window, you possibly can merely throw all the things right into a immediate you would possibly want—instruments, paperwork, directions, and extra—and let the mannequin maintain the remainder.
Lengthy contexts kneecapped RAG enthusiasm (no want to search out the very best doc when you possibly can match all of it within the immediate!), enabled MCP hype (join to each device and fashions can do any job!), and fueled enthusiasm for brokers.2
However in actuality, longer contexts don’t generate higher responses. Overloading your context may cause your brokers and functions to fail in suprising methods. Contexts can grow to be poisoned, distracting, complicated, or conflicting. That is particularly problematic for brokers, which depend on context to assemble data, synthesize findings, and coordinate actions.
Let’s run by means of the methods contexts can get out of hand, then evaluate strategies to mitigate or totally keep away from context fails.
Context Poisoning
Context poisoning is when a hallucination or different error makes it into the context, the place it’s repeatedly referenced.
The DeepMind staff known as out context poisoning within the Gemini 2.5 technical report, which we broke down previously. When taking part in Pokémon, the Gemini agent would often hallucinate, poisoning its context:
An particularly egregious type of this subject can happen with “context poisoning”—the place many elements of the context (objectives, abstract) are “poisoned” with misinformation concerning the sport state, which may typically take a really very long time to undo. Because of this, the mannequin can grow to be fixated on reaching not possible or irrelevant objectives.
If the “objectives” part of its context was poisoned, the agent would develop nonsensical methods and repeat behaviors in pursuit of a objective that can’t be met.
Context Distraction
Context distraction is when a context grows so lengthy that the mannequin over-focuses on the context, neglecting what it discovered throughout coaching.
As context grows throughout an agentic workflow—because the mannequin gathers extra data and builds up historical past—this amassed context can grow to be distracting moderately than useful. The Pokémon-playing Gemini agent demonstrated this drawback clearly:
Whereas Gemini 2.5 Professional helps 1M+ token context, making efficient use of it for brokers presents a brand new analysis frontier. On this agentic setup, it was noticed that because the context grew considerably past 100k tokens, the agent confirmed a bent towards favoring repeating actions from its huge historical past moderately than synthesizing novel plans. This phenomenon, albeit anecdotal, highlights an necessary distinction between long-context for retrieval and long-context for multistep, generative reasoning.
As an alternative of utilizing its coaching to develop new methods, the agent grew to become fixated on repeating previous actions from its in depth context historical past.
For smaller fashions, the distraction ceiling is far decrease. A Databricks study discovered that mannequin correctness started to fall round 32k for Llama 3.1-405b and earlier for smaller fashions.
If fashions begin to misbehave lengthy earlier than their context home windows are stuffed, what’s the purpose of tremendous massive context home windows? In a nutshell: summarization3 and reality retrieval. Should you’re not doing both of these, be cautious of your chosen mannequin’s distraction ceiling.
Context Confusion
Context confusion is when superfluous content material within the context is utilized by the mannequin to generate a low-quality response.
For a minute there, it actually appeared like everybody was going to ship an MCP. The dream of a robust mannequin, linked to all your companies and stuff, doing all of your mundane duties felt inside attain. Simply throw all of the device descriptions into the immediate and hit go. Claude’s system prompt confirmed us the way in which, because it’s principally device definitions or directions for utilizing instruments.
However even when consolidation and competition don’t slow MCPs, context confusion will. It turns on the market may be such a factor as too many instruments.
The Berkeley Function-Calling Leaderboard is a tool-use benchmark that evaluates the power of fashions to successfully use instruments to answer prompts. Now on its third model, the leaderboard reveals that each mannequin performs worse when supplied with a couple of device.4 Additional, the Berkeley staff, “designed eventualities the place not one of the supplied capabilities are related…we anticipate the mannequin’s output to be no perform name.” But, all fashions will often name instruments that aren’t related.
Searching the function-calling leaderboard, you possibly can see the issue worsen because the fashions get smaller:

A putting instance of context confusion may be seen in a recent paper that evaluated small mannequin efficiency on the GeoEngine benchmark, a trial that options 46 completely different instruments. When the staff gave a quantized (compressed) Llama 3.1 8b a question with all 46 instruments, it failed, regardless that the context was nicely inside the 16k context window. However after they solely gave the mannequin 19 instruments, it succeeded.
The issue is, if you happen to put one thing within the context, the mannequin has to concentrate to it. It might be irrelevant data or pointless device definitions, however the mannequin will take it under consideration. Massive fashions, particularly reasoning fashions, are getting higher at ignoring or discarding superfluous context, however we frequently see nugatory data journey up brokers. Longer contexts allow us to stuff in additional information, however this capability comes with downsides.
Context Conflict
Context conflict is once you accrue new data and instruments in your context that conflicts with different data within the context.
It is a extra problematic model of context confusion. The unhealthy context right here isn’t irrelevant, it immediately conflicts with different data within the immediate.
A Microsoft and Salesforce staff documented this brilliantly in a recent paper. The staff took prompts from a number of benchmarks and “sharded” their data throughout a number of prompts. Consider it this manner: Typically, you would possibly sit down and sort paragraphs into ChatGPT or Claude earlier than you hit enter, contemplating each needed element. Different occasions, you would possibly begin with a easy immediate, then add additional particulars when the chatbot’s reply isn’t passable. The Microsoft/Salesforce staff modified benchmark prompts to appear like these multistep exchanges:

All the data from the immediate on the left aspect is contained inside the a number of messages on the appropriate aspect, which might be performed out in a number of chat rounds.
The sharded prompts yielded dramatically worse outcomes, with a median drop of 39%. And the staff examined a variety of fashions—OpenAI’s vaunted o3’s rating dropped from 98.1 to 64.1.
What’s occurring? Why are fashions performing worse if data is gathered in levels moderately than ?
The reply is context confusion: The assembled context, containing the whole thing of the chat alternate, comprises early makes an attempt by the mannequin to reply the problem earlier than it has all the data. These incorrect solutions stay current within the context and affect the mannequin when it generates its last reply. The staff writes:
We discover that LLMs typically make assumptions in early turns and prematurely try and generate last options, on which they overly rely. In easier phrases, we uncover that when LLMs take a fallacious flip in a dialog, they get misplaced and don’t get well.
This doesn’t bode nicely for agent builders. Brokers assemble context from paperwork, device calls, and from different fashions tasked with subproblems. All of this context, pulled from various sources, has the potential to disagree with itself. Additional, once you hook up with MCP instruments you didn’t create there’s a better probability their descriptions and directions conflict with the remainder of your immediate.
Learnings
The arrival of million-token context home windows felt transformative. The power to throw all the things an agent would possibly want into the immediate impressed visions of superintelligent assistants that would entry any doc, join to each device, and keep excellent reminiscence.
However, as we’ve seen, greater contexts create new failure modes. Context poisoning embeds errors that compound over time. Context distraction causes brokers to lean closely on their context and repeat previous actions moderately than push ahead. Context confusion results in irrelevant device or doc utilization. Context conflict creates inner contradictions that derail reasoning.
These failures hit brokers hardest as a result of brokers function in precisely the eventualities the place contexts balloon: gathering data from a number of sources, making sequential device calls, participating in multi-turn reasoning, and accumulating in depth histories.
Thankfully, there are answers!
Mitigating and Avoiding Context Failures
Let’s run by means of the methods we will mitigate or keep away from context failures totally.
The whole lot is about data administration. The whole lot within the context influences the response. We’re again to the outdated programming adage of “garbage in, garbage out.” Fortunately, there’s loads of choices for coping with the problems above.
RAG
Retrieval-augmented technology (RAG) is the act of selectively including related data to assist the LLM generate a greater response.
As a result of a lot has been written about RAG, we’re not going to cowl it right here past saying: It’s very a lot alive.
Each time a mannequin ups the context window ante, a brand new “RAG is lifeless” debate is born. The final vital occasion was when Llama 4 Scout landed with a 10 million token window. At that dimension, it’s actually tempting to assume, “Screw it, throw all of it in,” and name it a day.
However, as we’ve already coated, if you happen to deal with your context like a junk drawer, the junk will influence your response. If you wish to study extra, right here’s a new course that looks great.
Instrument Loadout
Instrument loadout is the act of choosing solely related device definitions so as to add to your context.
The time period “loadout” is a gaming time period that refers back to the particular mixture of talents, weapons, and gear you choose earlier than a degree, match, or spherical. Normally, your loadout is tailor-made to the context—the character, the extent, the remainder of your staff’s make-up, and your personal ability set. Right here, we’re borrowing the time period to explain deciding on probably the most related instruments for a given activity.
Maybe the best approach to choose instruments is to use RAG to your device descriptions. That is precisely what Tiantian Gan and Qiyao Solar did, which they element of their paper “RAG MCP.” By storing their device descriptions in a vector database, they’re in a position to choose probably the most related instruments given an enter immediate.
When prompting DeepSeek-v3, the staff discovered that deciding on the appropriate instruments turns into vital when you might have greater than 30 instruments. Above 30, the descriptions of the instruments start to overlap, creating confusion. Past 100 instruments, the mannequin was just about assured to fail their check. Utilizing RAG methods to pick fewer than 30 instruments yielded dramatically shorter prompts and resulted in as a lot as 3x higher device choice accuracy.
For smaller fashions, the issues start lengthy earlier than we hit 30 instruments. One paper we touched on beforehand, “Less is More,” demonstrated that Llama 3.1 8b fails a benchmark when given 46 instruments, however succeeds when given solely 19 instruments. The problem is context confusion, not context window limitations.
To deal with this subject, the staff behind “Much less is Extra” developed a approach to dynamically choose instruments utilizing an LLM-powered device recommender. The LLM was prompted to motive about “quantity and sort of instruments it ‘believes’ it requires to reply the person’s question.” This output was then semantically searched (device RAG, once more) to find out the ultimate loadout. They examined this technique with the Berkeley Function-Calling Leaderboard, discovering Llama 3.1 8b efficiency improved by 44%.
The “Much less is Extra” paper notes two different advantages to smaller contexts—lowered energy consumption and velocity—essential metrics when working on the edge (that means, working an LLM in your cellphone or PC, not on a specialised server). Even when their dynamic device choice technique failed to enhance a mannequin’s end result, the ability financial savings and velocity beneficial properties have been definitely worth the effort, yielding financial savings of 18% and 77%, respectively.
Fortunately, most brokers have smaller floor areas that solely require just a few hand-curated instruments. But when the breadth of capabilities or the quantity of integrations must develop, at all times take into account your loadout.
Context Quarantine
Context quarantine is the act of isolating contexts in their very own devoted threads, every used individually by a number of LLMs.
We see higher outcomes when our contexts aren’t too lengthy and don’t sport irrelevant content material. One approach to obtain that is to interrupt our duties up into smaller, remoted jobs—every with its personal context.
There are many examples of this tactic, however an accessible write-up of this technique is Anthropic’s blog post detailing its multi-agent research system. They write:
The essence of search is compression: distilling insights from an unlimited corpus. Subagents facilitate compression by working in parallel with their very own context home windows, exploring completely different facets of the query concurrently earlier than condensing an important tokens for the lead analysis agent. Every subagent additionally gives separation of considerations—distinct instruments, prompts, and exploration trajectories—which reduces path dependency and permits thorough, impartial investigations.
Analysis lends itself to this design sample. When given a query, a number of brokers can determine and individually immediate a number of subquestions or areas of exploration. This not solely hurries up the data gathering and distillation (if there’s compute accessible), however it retains every context from accruing an excessive amount of data or data not related to a given immediate, delivering greater high quality outcomes:
Our inner evaluations present that multi-agent analysis techniques excel particularly for breadth-first queries that contain pursuing a number of impartial instructions concurrently. We discovered {that a} multi-agent system with Claude Opus 4 because the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our inner analysis eval. For instance, when requested to determine all of the board members of the businesses within the Info Know-how S&P 500, the multi-agent system discovered the right solutions by decomposing this into duties for subagents, whereas the single-agent system failed to search out the reply with sluggish, sequential searches.
This method additionally helps with device loadouts, because the agent designer can create a number of agent archetypes with their very own devoted loadout and directions for make the most of every device.
The problem for agent builders, then, is to search out alternatives for remoted duties to spin out onto separate threads. Issues that require context-sharing amongst a number of brokers aren’t significantly suited to this tactic.
In case your agent’s area is in any respect suited to parallelization, be sure you read the whole Anthropic write up. It’s glorious.
Context Pruning
Context pruning is the act of eradicating irrelevant or in any other case unneeded data from the context.
Brokers accrue context as they fireplace off instruments and assemble paperwork. At occasions, it’s price pausing to evaluate what’s been assembled and take away the cruft. This may very well be one thing you activity your foremost LLM with or you could possibly design a separate LLM-powered device to evaluate and edit the context. Or you could possibly select one thing extra tailor-made to the pruning activity.
Context pruning has a (comparatively) lengthy historical past, as context lengths have been a extra problematic bottleneck within the pure language processing (NLP) discipline previous to ChatGPT. Constructing on this historical past, a present pruning technique is Provence, “an environment friendly and sturdy context pruner for query answering.”
Provence is quick, correct, easy to make use of, and comparatively small—just one.75 GB. You possibly can name it in just a few strains, like so:
from transformers import AutoModel
provence = AutoModel.from_pretrained("naver/provence-reranker-debertav3-v1", trust_remote_code=True)
# Learn in a markdown model of the Wikipedia entry for Alameda, CA
with open('alameda_wiki.md', 'r', encoding='utf-8') as f:
alameda_wiki = f.learn()
# Prune the article, given a query
query = 'What are my choices for leaving Alameda?'
provence_output = provence.course of(query, alameda_wiki)
Provence edited the article, chopping 95% of the content material, leaving me with solely this relevant subset. It nailed it.
One may make use of Provence or an identical perform to cull paperwork or your entire context. Additional, this sample is a powerful argument for sustaining a structured5 model of your context in a dictionary or different kind, from which you assemble a compiled string prior to each LLM name. This construction would come in useful when pruning, permitting you to make sure the principle directions and objectives are preserved whereas the doc or historical past sections may be pruned or summarized.
Context Summarization
Context summarization is the act of boiling down an accrued context right into a condensed abstract.
Context summarization first appeared as a device for coping with smaller context home windows. As your chat session got here near exceeding the utmost context size, a abstract could be generated and a brand new thread would start. Chatbot customers did this manually in ChatGPT or Claude, asking the bot to generate a brief recap that might then be pasted into a brand new session.
Nevertheless, as context home windows elevated, agent builders found there are advantages to summarization moreover staying inside the whole context restrict. As we’ve seen, past 100,000 tokens the context turns into distracting and causes the agent to depend on its amassed historical past moderately than coaching. Summarization might help it “begin over” and keep away from repeating context-based actions.
Summarizing your context is straightforward to do, however laborious to excellent for any given agent. Realizing what data ought to be preserved and detailing that to an LLM-powered compression step is vital for agent builders. It’s price breaking out this perform as its personal LLM-powered stage or app, which lets you acquire analysis knowledge that may inform and optimize this activity immediately.
Context Offloading
Context offloading is the act of storing data exterior the LLM’s context, normally through a device that shops and manages the information.
This is perhaps my favourite tactic, if solely as a result of it’s so easy you don’t consider it can work.
Once more, Anthropic has a good write up of the technique, which particulars their “assume” device, which is mainly a scratchpad:
With the “assume” device, we’re giving Claude the power to incorporate a further considering step—full with its personal designated house—as a part of attending to its last reply… That is significantly useful when performing lengthy chains of device calls or in lengthy multi-step conversations with the person.
I actually admire the analysis and different writing Anthropic publishes, however I’m not a fan of this device’s title. If this device have been known as scratchpad, you’d know its perform instantly. It’s a spot for the mannequin to write down down notes that don’t cloud its context and can be found for later reference. The title “assume” clashes with “extended thinking” and needlessly anthropomorphizes the mannequin… however I digress.
Having an area to log notes and progress works. Anthropic reveals pairing the “assume” device with a domain-specific immediate (which you’d do anyway in an agent) yields vital beneficial properties: as much as a 54% enchancment in opposition to a benchmark for specialised brokers.
Anthropic recognized three eventualities the place the context offloading sample is beneficial:
- Instrument output evaluation. When Claude must fastidiously course of the output of earlier device calls earlier than performing and would possibly must backtrack in its method;
- Coverage-heavy environments. When Claude must comply with detailed tips and confirm compliance; and
- Sequential choice making. When every motion builds on earlier ones and errors are expensive (typically present in multi-step domains).
Takeaways
Context administration is normally the toughest a part of constructing an agent. Programming the LLM to, as Karpathy says, “pack the context windows just right,” well deploying instruments, data, and common context upkeep, is the job of the agent designer.
The important thing perception throughout all of the above ways is that context will not be free. Each token within the context influences the mannequin’s habits, for higher or worse. The huge context home windows of contemporary LLMs are a robust functionality, however they’re not an excuse to be sloppy with data administration.
As you construct your subsequent agent or optimize an current one, ask your self: Is all the things on this context incomes its preserve? If not, you now have six methods to repair it.
Footnotes
- Gemini 2.5 and GPT-4.1 have 1 million token context home windows, massive sufficient to throw Infinite Jest in there with loads of room to spare.
- The “Long form text” part within the Gemini docs sum up this optmism properly.
- Actually, within the Databricks research cited above, a frequent means fashions would fail when given lengthy contexts is that they’d return summarizations of the supplied context whereas ignoring any directions contained inside the immediate.
- Should you’re on the leaderboard, take note of the “Dwell (AST)” columns. These metrics use real-world tool definitions contributed to the product by enterprise, “avoiding the drawbacks of dataset contamination and biased benchmarks.”
- Hell, this whole listing of ways is a powerful argument for why you should program your contexts.