Post 6: Monitoring & Continuous Improvement โ Making Your Agent Smarter Every Week
May 31, 2026 ยท 5 min read
You’ve built the pipeline. The cron job fires. PRs appear while you sleep. It works.
Now the question: does it keep working? And does it get better?
An unmonitored agent is a liability. It might silently produce bad code, burn through API credits, or miss entire classes of bugs. A monitored agent gets better every week.
Layer 1: Cost Tracking
Every Claude Code run in print mode with --output-format json returns structured cost data:
{
"session_id": "75e2167f-...",
"num_turns": 7,
"total_cost_usd": 0.3415,
"duration_ms": 18423,
"modelUsage": {"claude-sonnet-4-6": {"costUSD": 0.3415}}
}
Extract and log this for every ticket:
def track_cost(agent_name, output_json):
data = json.loads(output_json)
return {
"agent": agent_name,
"turns": data.get("num_turns", 0),
"cost": data.get("total_cost_usd", 0),
"model": list(data.get("modelUsage", {}).keys())[0],
"status": data.get("subtype", "unknown")
}
Store in CSV or SQLite, then run a nightly summary cron job:
hermes cron create "0 19 * * *" \
--name "cost-summary" \
--prompt "Read today's cost CSV. Summarize: total spent, per-agent breakdown, number of tickets, average cost per ticket. Send to Telegram."
You’ll get nightly reports like:
๐ฐ Cost Report โ May 31, 2026
โข Total: $1.23
โข Tickets: 3 (all fixed)
โข Avg cost/ticket: $0.41
Breakdown:
Coder: $0.82 (67%)
Reviewer: $0.29 (24%)
QC: $0.09 (7%)
Layer 2: Weekly Health Check
Every Monday, a cron job reviews the past week:
hermes cron create "0 8 * * 1" \
--name "weekly-health-check" \
--prompt "Review the past 7 days of agent costs and PRs.
1. Success rate: what % of tickets were fixed?
2. Rework rate: what % got REWORK from reviewer?
3. Average turns per fix (are we getting more efficient?)
4. Any patterns in failed tickets?
5. Recommendation: upgrade/downgrade models?"
Layer 3: Compounding CLAUDE.md
This is the most important improvement mechanism. From Boris Cherny, creator of Claude Code:
**"Let Claude write rules for itself. Any time Claude does something wrong, tell it: 'Update CLAUDE.md so you do not repeat this.'"**
After every PR review, check if the reviewer found issues:
def update_claude_md_from_review(review_output, issue_num):
violations = parse_convention_violations(review_output)
if violations:
with open("/home/hermes/erp-app/CLAUDE.md", "a") as f:
f.write(f"\n# Added from issue #{issue_num} review:\n")
for v in violations:
f.write(f"- {v}\n")
What this looks like in practice:
By week 4, Claude knows 12 project-specific gotchas. The same prompts produce dramatically better output.
Layer 4: Model Upgrade Decisions
Decision rule: If a bug takes >15 turns with Sonnet, try Opus next time for that bug class. If always <5 turns, downgrade to Haiku.
Layer 5: Human-in-the-Loop Feedback
Every time you review an agent PR and leave feedback, capture it:
After reviewing PR #523:
"You used Optional[T] but we prefer T | None in this codebase"
โ Add to CLAUDE.local.md: "Use T | None, not Optional[T]"
Merge Rate Tracking:
Healthy pipeline: >70% of agent PRs merged within 24 hours.
If merge rate drops below 50%:
The Weekly Maintenance Checklist
Every Monday, 15 minutes:
โ Read the weekly health report (automated)
โ Review cost summary โ any spikes?
โ Review failed tickets โ why did they fail?
โ Check CLAUDE.md โ anything to add or prune?
โ Check CLAUDE.local.md โ any new PR feedback to codify?
โ Review merge rate โ trending up or down?
โ Run: `hermes doctor` โ any dependency issues?
โ Purge old sessions: `hermes sessions prune --older-than 30`
The Long-Term Trajectory
Here’s what I’ve seen after 4 weeks:
The agent gets cheaper, faster, and more accurate the longer it runs. Not because the model improved โ because its knowledge of the project did.
Series Summary
Getting Started Today
1. Hour 1: Install Hermes + Claude Code. Get claude -p working. 2. Day 1: Write a solid CLAUDE.md with commands, architecture, gotchas. 3. Day 2: Create the cron job. Let it run overnight on a test repo. 4. Day 3: Add Reviewer + QC agents. Implement Writer/Reviewer. 5. Week 2: Connect to real GitHub Issues. Start with one bug class. 6. Week 3: Add cost tracking + weekly reports. 7. Week 4: Compound. Every mistake โ CLAUDE.md. Every PR feedback โ CLAUDE.local.md.
The hardest part isn’t the code. It’s the discipline of writing down every gotcha and treating CLAUDE.md as a living document.
This concludes the “AI Junior Developer” series. All posts at susiloharjo.web.id.
Related: 9 Skills That Made My AI Junior Dev 10x Smarter.
Related: I Let AI Run My Blog for a Month: What Broke and Worked.
Discover more from Susiloharjo
Subscribe to get the latest posts sent to your email.