Claude Code’s 6-Week Quality Mystery: A Postmortem Analysis

TL;DR
– Three overlapping product changes caused a 3% quality drop across Claude Code (March-April 2026)
– Issues: reasoning effort downgrade, caching bug erasing thinking, system prompt verbosity limit
– All fixed by April 20; Anthropic reset usage limits for all subscribers
– Key lesson: AI system debugging requires complete context, broad evals, and tight user feedback loops

Three overlapping product changes caused a 3% quality drop across Sonnet 4.6, Opus 4.6, and Opus 4.7. Here’s what engineering teams can learn about AI system debugging and change management.

The Mystery

In early March 2026, Anthropic began receiving consistent user reports that Claude Code’s output quality had noticeably degraded. Users described the AI as “less intelligent,” “forgetful,” and “repetitive” during coding sessions. The feedback was widespread but inconsistent—some sessions worked fine, others fell apart after an hour of use.

For six weeks, Anthropic’s engineering team investigated. The API layer checked out. Model weights were unchanged. Internal evals showed no regression. Yet users were adamant: something had broken.

By April 23, Anthropic published a detailed postmortem revealing the root cause: not one bug, but three separate product changes that overlapped in production, each affecting different slices of traffic on different schedules. The aggregate effect looked like broad, inconsistent degradation.

All three issues were resolved by April 20 (v2.1.116). Anthropic reset usage limits for all subscribers as compensation.

The Three Issues

Issue #1: Reasoning Effort Downgrade (March 4 – April 7)

What happened: Anthropic changed Claude Code’s default reasoning effort from high to medium to reduce latency. Some users experienced extremely long thinking times in high mode, causing the UI to appear frozen.

The tradeoff: In general, longer thinking produces better output. Effort levels let users set that tradeoff—more intelligence versus lower latency and fewer usage limit hits. Internal evals showed medium effort achieved slightly lower intelligence with significantly less latency for most tasks.

The problem: Users noticed immediately. Claude Code felt less intelligent. Despite UI improvements to make the effort setting more visible (startup notices, inline selectors, bringing back ultrathink), most users retained the medium default.

Resolution: On April 7, Anthropic reversed the decision. All users now default to xhigh effort for Opus 4.7, and high effort for all other models.

Impact: Sonnet 4.6 and Opus 4.6.

Issue #2: Caching Bug That Erased Thinking (March 26 – April 10)

What happened: Anthropic shipped an efficiency improvement to clear Claude’s older thinking from sessions idle for over an hour. The design was sound: reduce users’ cost of resuming stale sessions by pruning unnecessary messages before sending to the API.

The bug: Instead of clearing thinking history once, it cleared on every turn for the rest of the session. After crossing the idle threshold, each request told the API to keep only the most recent reasoning block and discard everything before it.

The compounding effect: If a user sent a follow-up message while Claude was mid-tool-use, that started a new turn under the broken flag—dropping even the current turn’s reasoning. Claude continued executing but increasingly without memory of why it chose its actions. This surfaced as forgetfulness, repetition, and odd tool choices.

Why it was hard to catch: The bug sat at the intersection of Claude Code’s context management, the Anthropic API, and extended thinking. It passed multiple human and automated code reviews, unit tests, end-to-end tests, and dogfooding. Two unrelated internal experiments suppressed the bug in most CLI sessions, making reproduction nearly impossible for over a week.

Resolution: Fixed April 10 in v2.1.101. As part of the investigation, Anthropic back-tested Code Review against the offending pull requests using Opus 4.7. When provided complete repository context, Opus 4.7 found the bug; Opus 4.6 did not.

Impact: Sonnet 4.6 and Opus 4.6. Also drove reports of usage limits draining faster than expected (continuous cache misses from dropped thinking blocks).

Issue #3: System Prompt Verbosity Limit (April 16 – April 20)

What happened: Claude Opus 4.7 launched with a behavioral quirk: notable verbosity. While smarter on hard problems, it produced more output tokens. Anthropic tuned the system prompt to reduce verbosity before release.

The problematic instruction:

Length limits: keep text between tool calls to ≤25 words. Keep final responses to ≤100 words unless the task requires more detail.

The problem: After multiple weeks of internal testing with no regressions in the eval suite run, the change shipped with Opus 4.7 on April 16. During the investigation, broader ablations revealed a 3% quality drop for both Opus 4.6 and 4.7 on one evaluation.

Resolution: Reverted April 20 as part of the v2.1.116 release.

Impact: Sonnet 4.6, Opus 4.6, and Opus 4.7.

Why It Looked Like Broad Degradation

Each change affected a different slice of traffic on a different schedule:

Issue	Start Date	End Date	Models Affected
Reasoning effort	March 4	April 7	Sonnet 4.6, Opus 4.6
Caching bug	March 26	April 10	Sonnet 4.6, Opus 4.6
Prompt verbosity	April 16	April 20	Sonnet 4.6, Opus 4.6, Opus 4.7

The overlapping windows created a complex pattern. Users who hit the caching bug after an idle session experienced forgetfulness. Users on medium effort saw less intelligent output. Users with Opus 4.7 after April 16 got terse, lower-quality responses.

Neither Anthropic’s internal usage nor initial evals reproduced the issues. The reports were challenging to distinguish from normal variation in user feedback at first.

Lessons for Engineering Teams

1. Change Management in AI Systems Is Harder Than Traditional Software

AI systems have more degrees of freedom: model weights, system prompts, reasoning effort defaults, caching strategies, context management policies. Each can interact in non-obvious ways.

Best practice: Treat prompt changes with the same rigor as code changes. Run broad eval suites, perform ablations (removing lines to understand individual impact), and implement gradual rollouts with soak periods.

2. User Feedback Is Your Canary

Anthropic’s internal evals didn’t catch these issues. User feedback did. The /feedback command and specific, reproducible examples posted online ultimately allowed identification and resolution.

Best practice: Build tight feedback loops with production users. Instrument detailed telemetry. Don’t dismiss reports as “normal variation” without deep investigation.

3. Corner Cases Compound

The caching bug only triggered after an hour of idle time, then compounded on every subsequent turn. It was suppressed in most internal testing by orthogonal experiments.

Best practice: Test edge cases explicitly. Stale sessions, long-running processes, and stateful interactions need dedicated test coverage. Don’t rely solely on happy-path evals.

4. AI Agents Can Debug AI Systems (With Complete Context)

Opus 4.7 found the caching bug when given complete repository context; Opus 4.6 did not. This validates Anthropic’s investment in context engineering and tools like Code Review.

Best practice: Invest in tooling that gives AI agents complete context. The difference between success and failure often comes down to whether the model can see the full picture.

5. Transparency Builds Trust

Anthropic’s postmortem is remarkably detailed: specific dates, exact prompt instructions, technical root causes, and concrete remediation steps. They also created @ClaudeDevs on X for ongoing product communication.

Best practice: When things break, explain what happened, why, and what you’re changing. Users appreciate transparency over platitudes.

Best Practices for Your AI Deployments

Based on this postmortem, here’s a checklist for teams deploying AI systems:

Version your system prompts with the same discipline as code. Track every change.
Run ablations on prompt changes to understand the impact of each line.
Maintain a broad eval suite that covers edge cases, not just happy paths.
Implement gradual rollouts with soak periods for any change that could affect intelligence.
Instrument detailed telemetry on reasoning tokens, cache hit rates, and session duration.
Test stale sessions explicitly—idle timeouts, context pruning, and resumption logic.
Give AI debugging tools complete context when asking them to review AI system code.
Build tight user feedback loops and treat reports as high-priority signals.
Document tradeoffs explicitly (e.g., reasoning effort vs. latency) and make them user-configurable.
Publish postmortems when issues occur. Transparency compounds trust over time.

The Bigger Picture

This incident highlights a maturing challenge in AI engineering: as AI systems become more autonomous and stateful, traditional software debugging practices need to evolve. Context management, reasoning traceability, and prompt versioning are now critical infrastructure concerns.

Anthropic’s response—detailed postmortem, usage limit resets, improved Code Review tooling, tighter prompt change controls—demonstrates a team learning from production incidents. The fact that three separate changes interacted to create a complex failure mode should be a cautionary tale for any team deploying AI agents at scale.

For more on AI infrastructure and engineering best practices, see our previous analysis on essential AI terminology and system design patterns.

Call to Action

What’s your experience with AI system debugging? Have you encountered similar issues with reasoning models or agentic systems? Share your war stories in the comments below.

Subscribe to our newsletter for weekly deep dives on AI infrastructure, LLM engineering, and the tools shaping the future of software development.

This analysis is based on Anthropic’s official engineering postmortem published April 23, 2026. Technical details verified against primary sources.

—

## Further Reading

– cPanel Zero-Day Exploit in the Wild — practical security analysis
– Google AI Chips: Trillium vs H200 Deep Dive — hardware comparison

💬 Have a similar experience? Share it in the comments or contact us via our contact page.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.