As LLMs become more pervasive in software development it’s clear that the industry needs to come to grips with how to use them appropriately. Rather than transforming software engineering into an effortless breeze—as the cursed term vibe-coding might have you believe—LLM-assisted development introduces a unique set of challenges. If you don’t do it right, you can quickly end up lost in a mass of complexity, a swamp of oversaturated context that no LLM can help you with and that you won’t understand on your own. One of these challenges I have identified, and now dub, “context pollution.”
What Is Context Pollution?
Context pollution is when outdated memories, contradictory bits of information from old, unmaintained context, fast moving situations, and other legacy artifacts slowly accumulate over time and gradually degrade the performance of LLMs. It’s something I’ve noted happens often when using LLMs in my workflows. Essentially, context pollution is the gradual increase of disorder in an unmaintained context store over time.
Because things tend to naturally get more disordered over time unless some intervening agency expends energy to reverse it thanks to physics, some context pollution is virtually guaranteed without some pruning, updating, or cleaning.
To be an effective prompt engineer for software development (or to work with LLMs generally), you really need to think strategically, and many steps ahead, about how you will structure context. If you’ve worked with LLMs, you know that Context is King (or Queen, depending on the context.) In their typically autoregressive, purely symbolic, airless bubble worlds, LLMs don’t understand anything beyond the tokens you provide it.
Indeed, these systems are all about pattern completion—they take what you give it and extend it via the “most probable” next token. So in the “mind” of an LLM, what it gives back to us is literally the continuation of what we gave it (plus conversation history), bounced off its training weights.
Hence why a good prompt, and/or loads of informative context goes a long way toward improving the quality of a language model’s response. Purely in this respect they are like us: people tend to do a better job when you “fill them in” with more relevant context, too. The key is relevance.
Where language models differ in a major way from us is in how they handle subtle nuances of context. They more or less take it at face value, which is what exposes them to various vulnerabilities like prompt injection attacks, and context pollution. People, in contrast, are true contextual wizards, able to shift contexts, view contexts in terms of wider contexts, examine hidden pretexts and subtexts, and generally just situate any context in a wider world of meaning fluidly and dynamically and virtually unconsciously.
If you provided a piece of context to a model that was true yesterday but changed to false today, unless someone or something goes in and updates that piece of context, the LLM will be none the wiser and persist with this now obsolete information. They are typically not like people either, where you can just tell them once and expect that this will fluidly update their understanding of the situation.
Today we have systems that can create memories, but as so far from what I have seen, we don’t yet have systems capable of advanced metacognition that can reflexively and recursively update memories to track with changing, fluid scenarios and truly develop a proactive understanding of going concerns. Thus it’s more or less up to the system maintainers to periodically check, clean, and refresh context stores to ensure they are up to date and task-relevant.
Retrieval Augmented Generation systems automate much of this context management by offloading it to various data pipelines that operate independently from the language model, but these systems are still relatively primitive compared to the advanced metacognition exhibited by intelligent people.
Instead of Just More Context, Think the Right Context
One parameter you hear model developers brag about when talking about their benchmarks is the model’s context window. The context window is the size of the number of tokens a model can effectively process without substantial degradation of performance. Bigger is unambiguously better when it comes to context windows, however, no amount of context will make a difference if it is cluttered with inconsistent, outdated, or conflicting information. Semantics is another thing entirely and no context window however big will help.
Indeed, a solid argument could be made that more context can be counterproductive. The more context you pile on, the more cognitive overhead you’re placing, the more likely some conflicting information or at least ambiguity arises. Also, the larger the context window, the less familiar the input, meaning what the model predicts might be more tenuous. That’s why curated, up-to-date context is essential to keep a model on task. Basically the more context you give a model the more you increase the information entropy or disorder it has to chew through, which naturally makes its response less predictable.
In the best of all possible worlds, a model would only be served precisely the context it needs to solve the problem at hand, no more and no less. In practice, we use these models to find the needle in the haystack. Because LLMs are not deterministic search engines in the traditional sense this can be a messy affair.
Here’s the rub. If you’re feeding an LLM context, it’s probably because you’re not reading it yourself—a popular use case for LLMs is to have it “predigest” ponderous amounts of text or unstructured data to improve our time to comprehension. So the extent that we rely on them, we expose ourselves to this blindspot. One way around this is to make the model adept at spotting contradictions and notify you of inconsistencies, but after a certain point you’re going to have to build mechanisms to validate and synchronize your context data. Delegation to LLMs introduces new kinds of epistemic risk—sort of like the cognitive equivalent of not reading the terms and conditions.
Cross That Out: Language Models Struggle with Subtractive Concepts
Making matters worse, LLMs typically struggle with negating obsoleted context. It’s sort like “don’t think of a pink elephant” makes you think of a pink elephant. You see this happen often with AI image generators. Just ask it to remove and replace something it first generated and there’s a solid chance it will instead just strangely blur the two concepts. Telling it “except” will get you nowhere. As a definite example, for the image I asked ChatGPT to generate for this blog post, I originally asked it for a “brain with sticky notes all over it.” Then I decided it would be funnier if it was a cartoon robot. When I told ChatGPT to make the corrections I got this:
This amusingly ironic confused blending of concepts where subtraction and replacement is appropriate is context pollution in action. Only when I started a new session, with a fresh slate, did I get the image I wanted. While this image example vividly illustrates context pollution, it can apply to any context where conflicting signals accumulate over time.
Using WindSurf: A Case Study
I’ve taken to using WindSurf, one of the new AI-forward IDEs. For the most part I love it. Like ChatGPT, WindSurf has a memory feature, which can be useful to help structure the model’s understanding of the project’s context scope. What I quickly discovered however is that for fast moving software projects, these static memories were actually counterproductive.
Functional memory in practice is dependent on time-sensitive conditionals. So while I wanted to have the system remember the state of the project and its roadmap, this would quickly become out of date. True memory is actually a kind of series of semantically related memories that iterate and evolve as the situation changes. But the system is not intelligently tracking and managing these complex time sensitive conditions and updating memory accordingly. If I don’t go in and manually update the memories, the system would make a mess as it propagated an outdated view of the project onto the current state of the project.
(And in practice, all these “memories” constitute is a measly bit of text that gets interpolated with the user prompt. In biology, memory is more likely a complex semantic vector of some sorts. We aren’t just “storing” sentences in our memory.)
Problems like context pollution and the likely case that LLMs will be here to stay make me think a new subfield, context engineering, needs to be recognized. I’ve long predicted that AI will leave more work for us to do in its wake, and this is precisely what I’m seeing.
Until we develop far more sophisticated systems than we have today, in which models can engage in complex, reflexive, metacognitive memory and contextual self-management, it’s still on us to determine what the model thinks with. Until models can clean up after themselves, it’s up to us to think cleaner. Context isn’t just content—it’s a design responsibility.