← Journal/2026-06-11·11 min·context engineering

Context Engineering for LLM Agents: The Real Bottleneck

Prompt engineering is about the words. Context engineering is about what's in the window when the model reads them. Here's how I budget, prune, and persist context for agents that run for hours.

By Harel Asaf·AI Builder·Tel Aviv

Most people who ask me to "fix the prompt" don't have a prompt problem. They have a context problem. The instructions are fine. What's broken is everything else that was in the model's context window when it read those instructions — the stale tool output from twelve steps ago, the entire file when only nine lines mattered, the half-finished plan the agent already abandoned.

Prompt engineering is about choosing the words. Context engineering is about curating the state those words operate on. For a single-shot classification call, the two are nearly the same thing. For an agent that runs for an hour across forty tool calls, they could not be more different — and context engineering is the one that decides whether it succeeds.

The window is a budget, not a backpack

The mental model that ruins agents is treating the context window like a backpack: throw everything in, the model will find what it needs. Big windows make this feel safe. Claude Opus 4.8 ships a 1M-token context window. Surely you can just load the whole repo?

You can. You shouldn't. Two things go wrong as the window fills, and neither is about hitting the hard limit.

First, attention dilutes. A model reasoning over 4,000 relevant tokens is sharper than the same model reasoning over the same 4,000 tokens buried in 400,000 tokens of noise. The signal is still there; the model just has to find it, and every irrelevant token is a small tax on that search. Long-context benchmarks hide this because they ask a single needle-in-haystack question. Real agents ask hundreds of questions against a context that's mostly debris from earlier steps.

Second, cost and latency scale with what you load, every single turn. An agent is a loop: each step re-sends the accumulated history. If turn 30 carries 200K tokens of dead weight, you pay to process some version of that weight on turn 30, and 31, and 32. A bloated context isn't a one-time mistake. It's a recurring bill.

So the unit of context engineering is the token budget: a deliberate decision about how many tokens each part of the task is allowed to occupy, and what earns a slot.

What actually belongs in the window

I sort everything an agent might hold into four tiers, and the tier decides where it lives — not whether it's "useful." Almost everything is useful. That's the trap.

Tier 1 — the frozen core. The system prompt, the tool definitions, the role. This never changes within a run, so it sits at the very front of the prefix where it can be cached and never re-read at full price. The discipline here is don't touch it mid-run — no interpolated timestamps, no per-step mode flags. (More on why in a moment.)

Tier 2 — the live task. The current goal, the specific files or records in play right now, the last few tool results. This is the working set. It's allowed to be large because it's load-bearing.

Tier 3 — retrievable reference. Documentation, schemas, prior decisions, the rest of the codebase. This does not belong in the window by default. It belongs behind a tool — a search, a file read, a lookup — that the agent calls when the task actually needs it. Loading Tier 3 eagerly is the single most common way agents drown.

Tier 4 — cross-session memory. Anything that must survive the run ending: lessons, user preferences, project conventions. This lives on disk (a memory file, a store) and is read in deliberately at the start of a task, not carried in the conversation.

The skill is demotion. Most context that feels like Tier 2 is really Tier 3 — you don't need the whole file in the window, you need the ability to read the file. Give the agent the tool and let it pull what it needs, when it needs it.

Three levers for keeping the window lean

Once an agent is running, context grows whether you like it or not. Every tool call adds a result; every turn of thinking adds tokens. You need mechanisms that fight back. There are three, and they do different jobs.

Pruning: cut what's stale

The oldest tool results and the completed thinking blocks are usually dead weight by the time the agent is twenty steps in. Context editing clears them based on configurable thresholds — the content is removed, not summarized. Use it when old tool outputs are genuinely no longer relevant and you want to keep the transcript structurally intact while shrinking it. This is a scalpel: it removes specific kinds of blocks once they age out.

Compaction: summarize when you're near the wall

When a conversation is genuinely going to approach the window limit, compaction summarizes the earlier history into a compact block server-side. With Claude, this is a beta you opt into (the compact-2026-01-12 header), and it triggers automatically as you approach the threshold (the default is around 150K tokens).

There's one rule that bites everyone: append the full response content back to your messages every turn, not just the text. Compaction blocks come back in the response, and the API uses them to replace the compacted history on the next request. If you extract only the text string and append that, you silently throw away the compaction state and the whole mechanism breaks. Keep the structured content; keep the blocks.

Memory: persist across the boundary

Pruning and compaction operate within a run. Memory is for across runs. The model reads and writes files in a memory directory that survives the process ending. For long-lived agents, this is what turns "starts from zero every time" into "remembers what it learned last week."

The format matters more than people expect. One lesson per file, a one-line summary at the top, record both corrections and confirmed approaches, and update an existing note rather than spawning a duplicate. A memory store that's just an append-only log of everything the agent ever thought is Tier 3 noise with extra steps. A curated one is a genuine capability.

Most serious long-running agents use all three: prune the stale turns, compact when near the limit, persist the durable lessons to memory.

The cache is a context-engineering tool

Here's the part people miss: how you order your context determines whether you can cache it, and caching is a context-engineering decision, not a separate optimization.

Prompt caching is a prefix match. The model caches the exact bytes of your prompt up to a breakpoint, and any change anywhere before that point invalidates everything after it. The render order is fixed: tools, then system, then messages. So the architecture writes itself — stable content first, volatile content last.

This is exactly why Tier 1 must stay frozen. The moment you interpolate datetime.now() into the system prompt, you've put a per-request value at the front of the prefix, and nothing behind it can ever cache. The same goes for a tool list that varies per user, or a JSON blob serialized without sorted keys. These are context mistakes that happen to surface as cost mistakes — the cache read rate quietly drops to zero and the bill quietly triples.

If you're injecting something dynamic mid-conversation — a mode switch, freshly fetched state — the move is to append it as a message at the end, not to rewrite the system prompt at the front. A message at turn 5 invalidates nothing before turn 5. A system-prompt edit invalidates the entire run. I wrote a whole separate piece on the mechanics; for now, just internalize the ordering: frozen, then live, then volatile.

How I actually do it

When I build an agent, the context plan comes before the prompt. Concretely:

Start from the tools, not the text. What can the agent retrieve on demand? Every capability I can express as a tool is context I don't have to preload. A grep tool beats pasting the codebase. A read_file tool beats pasting the file.
Budget the working set out loud. I decide roughly how many tokens the live task should occupy and treat overruns as a signal that something is misclassified as Tier 2 when it's really Tier 3.
Instrument the window. I look at what's actually consuming tokens across a run — which tool results, which reads. (This is literally why I built a context auditor: you cannot manage what you can't see.) The biggest wins are almost always one verbose tool dumping huge results the agent reads once and never uses again.
Decide the persistence story up front. Does this agent need to remember anything after it stops? If yes, memory file and a convention for writing to it. If no, don't add the complexity.
Keep the prefix frozen and ordered for the cache. Stable first. No clocks in the system prompt. Deterministic tool order.

None of this is exotic. It's just the recognition that for an agent, the prompt is a small part of what the model reads — and the rest of it, the part nobody wrote on purpose, is usually what's holding the system back.

The one-sentence version

Prompt engineering asks what should I tell the model. Context engineering asks what should be true in the window when it reads that — and for anything that runs longer than a single call, the second question is the one that decides the outcome.

FAQ

What is context engineering?

Context engineering is the practice of deliberately curating what occupies an LLM's context window — what to load, what to retrieve on demand, what to prune, and what to persist — so the model reasons over signal instead of accumulated noise. It's distinct from prompt engineering, which is about the wording of the instructions themselves.

How is context engineering different from prompt engineering?

Prompt engineering shapes the instructions; context engineering shapes the state those instructions operate on. For a single API call they nearly coincide. For a long-running agent that accumulates tool results and history across many turns, context engineering is the larger lever.

Why not just load everything into a large context window?

Two reasons. Attention dilutes — the same relevant tokens are easier to reason over when they aren't buried in irrelevant ones. And cost and latency scale with everything you load, every turn, because agents re-send accumulated history on each step. A 1M-token window makes overloading possible, not free.

What is compaction and when should I use it?

Compaction summarizes a conversation's earlier history into a compact block when it approaches the context-window limit, so a long agent can keep going. Use it for conversations likely to reach the limit. The critical detail: append the full structured response content back each turn, not just the text, or you lose the compaction state.

What's the difference between pruning, compaction, and memory?

Pruning (context editing) removes stale tool results and thinking blocks within a run. Compaction summarizes earlier history when nearing the window limit, within a run. Memory persists information to disk across runs. Long-lived agents typically use all three.

How does context ordering affect prompt caching?

Caching is a prefix match, so any change before a cache breakpoint invalidates everything after it. Put stable content first (frozen system prompt, deterministic tool list) and volatile content last. Interpolating per-request values like timestamps into the system prompt silently destroys the cache for the whole request.

Build log

Get an email when I ship a new prototype or essay. No funnel — just the work.

Next in the journal →

What Is MCP? The Model Context Protocol, Explained by Someone Who Ships Servers

MCP is the USB-C of LLM tooling: one protocol that lets any model talk to any tool or data source. Here's what it is, why it exists, and how I think about building a server.