Claude Code’s Compaction Engine: The Architecture of Long-Context Reasoning

The fundamental challenge of modern AI agents is not just intelligence, but coherence over time. As an agent engages in a multi-hour session involving thousands of lines of terminal output, file edits, and tool calls, the context window—however vast—becomes a liability. A bloated context window leads to “hallucination by distraction,” where the model loses track of the core objective amidst the noise of past tool results. More practically, it leads to astronomical API costs as every new turn re-processes the entire historical bloat.

With the public emergence of Claude Code, the technical community has gained a rare, inside look into Context Engineering: the deterministic curation of an agent’s memory. This is not naive, lossy summarization; it is a multi-tier compaction strategy designed to maximize utility while religiously protecting the economics of the Prompt Cache.

The Three Tiers of Context Compaction

Analysis of the Claude Code source reveals that compaction is not a single “cleanup” event, but a tiered response triggered by specific token thresholds. Anthropic implements three distinct layers of context pruning, each more aggressive than the last:

Tier 1: Lightweight Deterministic Cleanup (Micro-Compaction)

Executed before every single API call, Tier 1 is the fastest and least expensive layer. It targets the “low-hanging fruit” of context bloat: old tool results. Instead of keeping every historical file read or terminal output, the engine clears all but the most recent five tool results. These are replaced with a simple placeholder: [Old tool result content cleared]. This tier requires zero LLM reasoning and effectively caps the growth of the conversation prefix during active debugging sessions. It is the “garbage collection” of the AI world.

Tier 2: API-Level Token Management

Tier 2 operates at the protocol level, often server-side. It manages the internal “thinking” blocks (the reasoning tokens) that models like Claude 3.7 or GPT-4.5 generate. These thinking blocks are useful for the current turn but are often redundant for future turns. Tier 2 selectively prunes these blocks and further trims historical tool results once a specific token “warmth” is reached, ensuring the agent doesn’t send its own internal monologue back to itself as part of the context.

Tier 3: LLM-Driven Recursive Summarization (The Last Resort)

When Tiers 1 and 2 are insufficient to keep the session within efficient bounds, the engine triggers Tier 3. This is the heavy hitter: a full LLM call where the model is asked to summarize the entire session into a highly structured 9-section schema. This summary includes:

Intent: The primary goal requested by the user.
Technical Concepts: Libraries, frameworks, and architectural decisions discovered.
Files Touched: A registry of modified or read files to maintain situational awareness.
Errors and Fixes: A log of what failed and how it was resolved, preventing the agent from repeating mistakes.
Pending Tasks: The remaining items in the TODO list.

This tier utilizes a chain-of-thought scratchpad to ensure no technical detail is lost before the raw history is discarded. It essentially “distills” the experience into a compact knowledge base.

The Economics of Prompt Caching: hits are Everything

The most critical engineering insight in Claude Code’s design is the protection of the Prompt Cache. In modern LLM pricing, cached tokens are up to 90% cheaper than fresh tokens. However, deleting even a single character from the middle of a conversation invalidates the entire cache for all subsequent messages. You end up paying 1.25x for “cache writes” instead of a 90% discount.

To solve this, Claude Code uses cache_edits. Instead of modifying local message history, it sends surgical instructions to the server to ignore specific tool blocks by their ID. This keeps the prompt cache “warm” and intact. Furthermore, when Tier 3 summarization triggers, it reuses the exact same system prompt and tools as the main conversation. The summarization command is simply appended at the end. This allows the summarization call to hit the existing cache, avoiding a 98% cache miss rate that would occur if a dedicated “summarizer” system prompt were used.

Post-Compaction Reconstruction: Avoiding the “Blank Stare”

After summarization, Claude Code doesn’t just drop the summary into an empty context and hope for the best. It performs a methodical **Reconstruction**:
1. It injects a boundary marker with pre-compaction metadata.
2. It re-reads the 5 most recently modified files (capped at 50K tokens) to restore immediate working memory.
3. It re-injects relevant Model Context Protocol (MCP) skills based on recent usage.
4. It restores the CLAUDE.md or project-specific instructions.

Crucially, the agent is told: “You were already working; don’t acknowledge the summary, just continue.” This ensures the transition is seamless for both the user and the agent’s internal state.

Engineering for the Agentic Era

As we move toward more autonomous systems like KavachOS, which handle their own authentication and delegation, the efficiency of their “working memory” is the primary bottleneck. Claude Code proves that the future of AI agents lies not in infinitely larger context windows, but in more elegant curation engines. For architects building on TurboQuant principles of data efficiency, the lesson is clear: Summary is the last resort. Deterministic curation is the first.

Source references:

– Claude Code’s Compaction Engine Analysis (Johnib)

– Barazany.dev Technical Breakdown

– Anthropic Claude Code Documentation

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.