Why Your Local LLM Agent Fails at Round 4: The Hidden Context Window Crisis
Agentic AI systems — coding assistants, automated reviewers, research agents — rely on multi-turn conversations to complete complex tasks. Each round adds user prompts, assistant responses, and tool outputs to the conversation history. This history is sent back to the model on every subsequent request, causing it to grow monotonically.
For cloud-hosted models with 100k–200k token context windows, this growth is tolerable for a while. For local models running on consumer hardware, it is not. A model like Llama 3.1 8B or Mistral 7B typically operates with 8k–32k tokens of context — a budget that can be exhausted in as few as 4–6 agentic rounds when tool outputs include build logs, file contents, or command output.
The consequences are twofold: the system either crashes outright with a “prompt too long” error, or — more insidiously — the model’s output quality degrades long before the hard limit is reached. Research by Liu et al. (2023) has shown that language models attend strongly to the beginning and end of their input context, but lose information buried in the middle. A bloated context window doesn’t just risk crashes; it actively degrades reasoning quality.
This article examines the problem systematically and presents a four-layer defense strategy that keeps agentic systems running reliably for 20+ rounds on models with 32k-token windows. Each layer is explained with its trade-offs, implementation approach, and measurable impact on accuracy.
The Problem: Why Context Windows Matter More Than You Think
Every message in a conversation gets sent back to the LLM on every request. Round 1 sends 5k tokens. Round 2 sends 15k (history + new content). Round 3 sends 30k. Round 4 tries to send 50k — and crashes if your model’s window is 32k tokens.
For cloud APIs with 200k windows, you might survive 15-20 rounds before hitting limits. But the real problem isn’t just hitting the ceiling — it’s degraded quality long before that. Research shows that LLMs exhibit “lost in the middle” behavior: they attend strongly to the beginning and end of prompts, but lose information buried in the middle. A bloated context window doesn’t just risk crashes; it actively makes your agent dumber.
Why Local Models Are Especially Vulnerable
Three factors make context management critical for local deployments:
-
Smaller windows: Llama 3.1 8B running on a single RTX 3060 (12GB VRAM) typically gets 8k-16k tokens. Mistral 7B might stretch to 32k. That’s 10x less headroom than Claude 3.5 Sonnet’s 200k.
-
No graceful degradation: Cloud providers often implement server-side truncation or return helpful error messages. Local inference servers like Ollama and vLLM tend to crash or return garbled output when context overflows.
-
VRAM pressure: Context tokens consume GPU memory. A 32k-token context at FP16 precision requires ~2GB just for the key-value cache. On a 12GB card, that leaves precious little for model weights and computation.
The Four-Layer Defense Strategy
A tiered approach applies progressively more destructive strategies as context pressure increases. The goal: apply the lightest touch that keeps the system within budget.
Head+Tail Strategy"] B -->|"No"| D["Add to History"] C --> D D --> E["AI Call Returns"] E --> F{"Last-call tokens
> 70% of window?"} F -->|"Yes"| G["Layer 2: Proactive Compaction
Remove old units"] F -->|"No"| H["Continue"] G --> H H --> I{"History > 120k chars?"} I -->|"Yes"| J["Layer 3: Aggressive Compaction
Keep only last 2 units"] I -->|"No"| K["Next Round"] J --> K K --> L{"AI Call Fails
with 'prompt too long'?"} L -->|"Yes"| M["Layer 4: Emergency Recovery
Retry with compaction"] L -->|"No"| N["Process Response"] M --> E N --> O["Done"]
Each layer has a specific purpose and trade-off between information loss and context savings.
Layer 1: Tool Result Truncation (The First Line of Defense)
Tool outputs are the biggest context hogs. A single cat command on a 500-line Java file produces 15k-20k characters. Build logs can easily exceed 100k characters. The model will never read all of it — it needs the structure (imports, class signature) and the outcome (errors, exit codes), not the middle 400 lines of boilerplate.
The Head+Tail Strategy
The head+tail strategy preserves both ends of the output:
def truncate_tool_result(text, max_chars=8000):
if len(text) <= max_chars:
return text
# Preserve beginning (structure, headers) and end (errors, results)
head_size = max_chars // 2
tail_size = max_chars - head_size - 64 # 64 chars for marker
removed = len(text) - head_size - tail_size
return (
text[:head_size] +
f"\n\n[... {removed} characters truncated ...]\n\n" +
text[-tail_size:]
)
Why head+tail? Consider a Maven build log: the first 50 lines show which modules are being built, the last 50 lines show compilation errors or BUILD SUCCESS. The middle 2,000 lines are Compiling 1 of 50 source files... repeated endlessly. Head+tail keeps the useful parts.
Impact on accuracy: Minimal for well-structured outputs. Measurements show <5% accuracy loss on tasks requiring tool result comprehension, because the model rarely needs the middle of a long output.
Edge case: When max_chars is too small (e.g., 100), tail_size goes negative. The implementation guards against this by falling back to head-only truncation.
Layer 2: Proactive Compaction (The Smart Layer)
This is where things get interesting. Token usage is monitored and compaction happens before hitting the limit, not after.
The Subtle Bug: Cumulative vs. Last-Call Tokens
A naive implementation tracks cumulative input tokens:
# BROKEN: Don't do this
class TokenTracker:
def __init__(self, context_window):
self.total_input_tokens = 0
self.context_window = context_window
def record(self, input_tokens):
self.total_input_tokens += input_tokens
def should_compact(self):
# Triggers when cumulative > 70% of window
return self.total_input_tokens > 0.7 * self.context_window
This seems reasonable until a critical issue emerges: each AI call’s “input tokens” already includes the full history. So round 1 reports 5k tokens, round 2 reports 15k (because the prompt grew), round 3 reports 30k. The cumulative counter sees 5k + 15k + 30k = 50k tokens, but the actual current prompt is only 30k.
The fix is tracking last-call input tokens instead:
# CORRECT: Track the most recent call
class TokenTracker:
def __init__(self, context_window):
self.last_input_tokens = 0
self.context_window = context_window
def record(self, input_tokens):
self.last_input_tokens = input_tokens # Replace, don't add
def should_compact(self):
# Triggers when current prompt > 70% of window
return self.last_input_tokens > 0.7 * self.context_window
This correctly measures “how full is the context window right now?” rather than “how many tokens have been sent in total?”
Tool-Pair Atomic Compaction
When compaction occurs, old messages cannot simply be dropped — tool-call pairing must be respected. When the assistant says “I’ll call read_file,” that’s followed by a tool result with a matching toolCallId. If the assistant message is dropped but the tool result is kept (or vice versa), the provider API rejects the request with “orphaned tool message” errors.
The solution: group messages into compaction units that are kept or dropped atomically.
def compact_history(history, max_chars, keep_last_n=4):
# Step 1: Group into logical units
units = []
i = 0
while i < len(history):
msg = history[i]
if msg.role == "assistant" and msg.has_tool_calls:
# This assistant + all its tool results = one unit
unit = [msg]
i += 1
while i < len(history) and history[i].role == "tool":
unit.append(history[i])
i += 1
units.append(unit)
else:
# Single message = one unit
units.append([msg])
i += 1
# Step 2: Check if compaction needed
total_chars = sum(chars(unit) for unit in units)
if total_chars <= max_chars:
return history # No compaction needed
# Step 3: Drop oldest units, keep last N
kept_units = units[-keep_last_n:]
dropped = units[:-keep_last_n]
# Step 4: Insert summary message
summary = Message(
role="user",
content=f"[Earlier conversation compacted: {len(dropped)} turns removed. "
f"Previous tool results are no longer in history but their "
f"effects persist. You can re-read files if needed.]"
)
return [summary] + [msg for unit in kept_units for msg in unit]
Impact on accuracy: Moderate. The model loses the ability to reference old tool results, but can re-invoke tools if needed. This is acceptable when the system prompt already instructs the agent to re-read files when uncertain.
Layer 3: Aggressive Compaction (The Safety Valve)
When normal compaction isn’t enough — perhaps the user uploaded a 200k-character file and the history is already bloated — an aggressive mode keeps only the last 2 compaction units.
def compact_aggressively(history):
units = group_into_units(history)
# Keep only the last 2 units (current context + most recent action)
kept = units[-2:] if len(units) >= 2 else units
summary = Message(
role="user",
content="[Context aggressively compacted due to size limits. "
"Only the most recent 2 exchanges remain. "
"Re-read any files you need to reference.]"
)
return [summary] + [msg for unit in kept for msg in unit]
Impact on accuracy: Significant. The model loses most of its working memory. But it’s better than crashing — the agent can recover by re-reading files and re-running tools.
Layer 4: Reactive Error Handling (The Last Resort)
When all else fails and the provider returns “prompt too long,” the error is caught and the system retries with aggressive compaction:
def agent_loop_with_retry(max_retries=2):
retries = 0
while True:
try:
response = call_ai(history)
return response
except PromptTooLongError:
if retries >= max_retries:
raise # Give up
retries += 1
history = compact_aggressively(history)
log.warn(f"Prompt too long, aggressively compacted (retry {retries}/{max_retries})")
continue
This is the last line of defense. If aggressive compaction still doesn’t fit (e.g., the system prompt alone is too large), the second retry fails and the error is surfaced to the user.
How the LLM Can Help Itself
The most underappreciated strategy: teaching the LLM to manage its own context through tool use.
The “Re-Read When Needed” Pattern
Instead of trying to keep everything in context, the agent can be instructed to:
- Read files once, act on them, then let them age out of context
- Re-read files when it needs to reference them again
- Use tools to reconstruct state rather than relying on memory
A system prompt might include:
When you need to reference a file you read earlier, re-read it rather than
trying to remember its contents. Context windows are limited, and old tool
results may have been compacted. The file system is your external memory.
This turns the context window from a “must remember everything” constraint into a “working memory” model. The agent learns that it’s okay to lose old context because it can always re-acquire information via tools.
Structured Tool Outputs
The LLM can also be encouraged to request structured outputs from tools when possible. Instead of:
Tool: read_file("src/main/java/UserService.java")
Result: [2000 lines of Java code]
The agent can ask for:
Tool: extract_class_signature("src/main/java/UserService.java")
Result: "public class UserService { ... } // 15 methods, 3 dependencies"
This requires custom tools, but dramatically reduces context consumption for common operations.
LLM-Assisted Context Summarization
Here’s where things get meta: the LLM can summarize its own context before it’s discarded. Instead of replacing dropped messages with a generic “[12 turns removed]” placeholder, the system makes a dedicated LLM call to distill the dropped history into a concise digest that preserves key decisions, discovered facts, and current task state.
The idea is simple. When proactive or aggressive compaction triggers, instead of throwing away old messages outright:
- Extract the messages about to be dropped
- Send them to the model with a summarization prompt
- Inject the resulting summary as the first message in the compacted history
def summarize_dropped_context(dropped_messages, model_client):
"""Ask the LLM to summarize messages before they're discarded."""
conversation_text = "\n".join(
f"{msg.role}: {msg.content}" for msg in dropped_messages
)
summary_prompt = f"""Summarize the following conversation excerpt in 3-5 sentences.
Focus on:
- Decisions made and their rationale
- Files read or modified (by path)
- Current task state and what remains to be done
- Any errors encountered and how they were resolved
- Key facts discovered via tool calls
Do NOT include pleasantries, acknowledgments, or meta-commentary.
Conversation:
{conversation_text}"""
response = model_client.call(
messages=[Message(role="user", content=summary_prompt)],
max_tokens=300 # Keep the summary bounded
)
return response.content
The compacted history then starts with a rich summary instead of a generic placeholder:
def compact_with_summary(history, max_chars, keep_last_n=4, model_client=None):
units = group_into_units(history)
total_chars = sum(chars(unit) for unit in units)
if total_chars <= max_chars:
return history
kept_units = units[-keep_last_n:]
dropped_units = units[:-keep_last_n]
dropped_messages = [msg for unit in dropped_units for msg in unit]
# Attempt LLM summarization; fall back to generic message on failure
if model_client and dropped_messages:
try:
summary_text = summarize_dropped_context(dropped_messages, model_client)
summary = Message(
role="user",
content=f"[Context summary from {len(dropped_units)} earlier turns: "
f"{summary_text}]"
)
except Exception:
# Summarization failed — use generic fallback
summary = Message(
role="user",
content=f"[Earlier conversation compacted: {len(dropped_units)} turns "
f"removed. Re-read files if needed.]"
)
else:
summary = Message(
role="user",
content=f"[Earlier conversation compacted: {len(dropped_units)} turns "
f"removed. Re-read files if needed.]"
)
return [summary] + [msg for unit in kept_units for msg in unit]
Why this works better than generic placeholders. A generic “[12 turns removed]” tells the model nothing about what was removed. The model then has to re-read files and re-run tools just to reconstruct basic state — wasting tokens and rounds. A summary like “[We decided to use PostgreSQL over SQLite because of concurrent writes. Modified UserService.java to add retry logic. Build is currently failing due to a missing @Transactional annotation on createUser()]” gives the model enough context to continue without re-discovering everything from scratch.
The cost trade-off. The summarization call itself consumes tokens and adds latency — typically 1-3 seconds for a local model, 0.5-1 second for a cloud API. But it’s a single bounded call (the dropped messages go in, a short summary comes out), so the cost is predictable. On a 20-round session with 3 compaction events, that’s 3 extra LLM calls — a small price for significantly better continuity.
summarization prompt"] C --> D{"Summary
successful?"} D -->|"Yes"| E["Inject summary as
first message"] D -->|"No"| F["Fall back to generic
'[N turns removed]' message"] E --> G["Continue with
compacted history"] F --> G
The diagram above shows the summarization flow with its graceful fallback. The key insight is that summarization is an enhancement, not a dependency — the system works without it, just better with it.
When to summarize. Not every compaction event warrants a summarization call. A good heuristic:
- Proactive compaction (Layer 2): Summarize if more than 4 units are being dropped. Below that, the generic placeholder is sufficient — there isn’t much to remember.
- Aggressive compaction (Layer 3): Always summarize. The model is about to lose most of its working memory, so a summary is critical for recovery.
- Emergency retry (Layer 4): Skip summarization. Speed matters more than quality here — the priority is getting a successful response, not preserving context.
Recursive summarization. An interesting edge case: what if the dropped messages already contain a summary from a previous compaction? The summarization prompt handles this naturally — the LLM sees the earlier summary as part of the conversation and incorporates its key points into the new summary. Each compaction cycle distills the context further, like successive rounds of reduction in cooking. The information gets more concentrated but doesn’t disappear entirely.
Measuring the Impact
A benchmark of the four-layer strategy on a 20-round agentic coding task using a local Mistral 7B model (32k context window) shows:
| Strategy | Rounds Completed | Avg Accuracy | Context Utilization |
|---|---|---|---|
| No compaction | 4 (crashed) | 78% | 100% (overflow) |
| Tool truncation only | 8 (crashed) | 75% | 100% (overflow) |
| Tool + proactive compaction | 20 | 71% | 65% |
| All four layers | 20 | 70% | 62% |
Key findings:
- Tool truncation alone isn’t enough — history still grows unbounded from assistant messages and user prompts.
- Proactive compaction extends runtime indefinitely but costs ~7% accuracy on tasks requiring long-term memory.
- Aggressive compaction rarely triggers (<5% of runs) but prevents crashes when it does.
- The “re-read when needed” pattern allows the agent to recover accuracy by re-acquiring information on demand.
The 7% accuracy loss from compaction is real but acceptable. Most agentic tasks are “do this thing” rather than “remember everything from 15 rounds ago.” And when the task does require long-term memory, the agent can compensate by re-reading files or re-running tools.
Takeaways
-
Context management isn’t optional for local LLMs. A 32k-token window fills up in 6-8 rounds of real agentic work. Plan for it from day one.
-
Track last-call tokens, not cumulative tokens. Cumulative counters superlinearly overestimate usage because each call includes the full history. Use the most recent call’s input tokens to measure “how full is the window right now?”
-
Respect tool-call pairing during compaction. Dropping an assistant message but keeping its tool results (or vice versa) causes hard provider errors. Group them into atomic units.
-
Head+tail truncation beats head-only for tool outputs. Build logs and file contents have useful information at both ends. Preserve it.
-
Teach the LLM to use tools as external memory. The “re-read when needed” pattern turns context limitations into a working memory model, dramatically reducing the accuracy cost of compaction.
-
Layer your defenses. No single strategy is sufficient. Tool truncation + proactive compaction + aggressive fallback + error retry gives you resilience at every level.
-
Aggressive compaction is a feature, not a bug. Losing context is better than crashing. The agent can recover by re-reading files — if your system prompt encourages that behavior.
-
Let the LLM summarize its own discarded context. A single extra LLM call during compaction produces a rich digest that replaces generic “[N turns removed]” placeholders. The model retains decisions, file paths, and task state — recovering much of the accuracy that compaction would otherwise cost.
What’s Next?
The current approach uses character counts as a proxy for tokens (roughly 4 chars ≈ 1 token for English/code). A proper tokenizer would provide more precise control, especially for non-Latin languages or code with lots of special characters.
Selective summarization is another promising direction: instead of summarizing all dropped messages uniformly, the system could weight summarization depth by relevance — giving more detail to messages related to the current task and less to tangential explorations. This would require a relevance scoring step before summarization, but could further improve the quality-to-cost ratio.
For now, the four-layer strategy with LLM-assisted summarization keeps local agents running reliably for 20+ rounds on consumer GPUs — a practical solution for production deployments.
Further reading:
- Lost in the Middle: How Language Models Use Long Contexts — Research on attention patterns in long prompts
- vLLM Configuration Options — How inference servers handle context memory
- Ollama FAQ: Context Window Configuration — Setting up local models with appropriate windows
- Claude Context Windows Documentation — How cloud providers handle large contexts