Monitoring & Continuous Improvement โ€” Making Your Agent Smarter

Post 6: Monitoring & Continuous Improvement โ€” Making Your Agent Smarter Every Week

May 31, 2026 ยท 5 min read

You’ve built the pipeline. The cron job fires. PRs appear while you sleep. It works.

Now the question: does it keep working? And does it get better?

An unmonitored agent is a liability. It might silently produce bad code, burn through API credits, or miss entire classes of bugs. A monitored agent gets better every week.


Layer 1: Cost Tracking

Every Claude Code run in print mode with --output-format json returns structured cost data:


{

"session_id": "75e2167f-...",

"num_turns": 7,

"total_cost_usd": 0.3415,

"duration_ms": 18423,

"modelUsage": {"claude-sonnet-4-6": {"costUSD": 0.3415}}

}

Extract and log this for every ticket:


def track_cost(agent_name, output_json):

data = json.loads(output_json)

return {

"agent": agent_name,

"turns": data.get("num_turns", 0),

"cost": data.get("total_cost_usd", 0),

"model": list(data.get("modelUsage", {}).keys())[0],

"status": data.get("subtype", "unknown")

}

Store in CSV or SQLite, then run a nightly summary cron job:


hermes cron create "0 19 * * *" \

--name "cost-summary" \

--prompt "Read today's cost CSV. Summarize: total spent, per-agent breakdown, number of tickets, average cost per ticket. Send to Telegram."

You’ll get nightly reports like:


๐Ÿ’ฐ Cost Report โ€” May 31, 2026

โ€ข Total: $1.23

โ€ข Tickets: 3 (all fixed)

โ€ข Avg cost/ticket: $0.41



Breakdown:

Coder: $0.82 (67%)

Reviewer: $0.29 (24%)

QC: $0.09 (7%)


Layer 2: Weekly Health Check

Every Monday, a cron job reviews the past week:


hermes cron create "0 8 * * 1" \

--name "weekly-health-check" \

--prompt "Review the past 7 days of agent costs and PRs.

1. Success rate: what % of tickets were fixed?

2. Rework rate: what % got REWORK from reviewer?

3. Average turns per fix (are we getting more efficient?)

4. Any patterns in failed tickets?

5. Recommendation: upgrade/downgrade models?"


Layer 3: Compounding CLAUDE.md

This is the most important improvement mechanism. From Boris Cherny, creator of Claude Code:

**"Let Claude write rules for itself. Any time Claude does something wrong, tell it: 'Update CLAUDE.md so you do not repeat this.'"**

After every PR review, check if the reviewer found issues:


def update_claude_md_from_review(review_output, issue_num):

violations = parse_convention_violations(review_output)



if violations:

with open("/home/hermes/erp-app/CLAUDE.md", "a") as f:

f.write(f"\n# Added from issue #{issue_num} review:\n")

for v in violations:

f.write(f"- {v}\n")

What this looks like in practice:

  • Week 1: Claude writes raw SQL โ†’ CLAUDE.md updated: “Never inline raw SQL, use queries.py”
  • Week 2: Claude uses naive datetime โ†’ “Date filtering uses timezone-aware objects”
  • Week 3: Claude forgets auth-failure test โ†’ “All new endpoint tests must include auth-failure case”
  • By week 4, Claude knows 12 project-specific gotchas. The same prompts produce dramatically better output.


    Layer 4: Model Upgrade Decisions

    Scenario Current Upgrade To Why Simple CRUD bug Haiku โœ“ โ€” Don't overpay Logic bug, multi-file Sonnet Opus Better reasoning Security/auth bug Sonnet Opus + manual review Never auto-merge auth Refactoring 5+ files Sonnet Opus + plan mode Explore before executing

    Decision rule: If a bug takes >15 turns with Sonnet, try Opus next time for that bug class. If always <5 turns, downgrade to Haiku.


    Layer 5: Human-in-the-Loop Feedback

    Every time you review an agent PR and leave feedback, capture it:

    
    After reviewing PR #523:
    
    "You used Optional[T] but we prefer T | None in this codebase"
    
    โ†’ Add to CLAUDE.local.md: "Use T | None, not Optional[T]"
    
    

    Merge Rate Tracking:

    Healthy pipeline: >70% of agent PRs merged within 24 hours.

    If merge rate drops below 50%:

  • CLAUDE.md is stale or wrong
  • Reviewer isn’t catching issues
  • Agent is attempting bugs it can’t fix

  • The Weekly Maintenance Checklist

    Every Monday, 15 minutes:

    
    โ˜ Read the weekly health report (automated)
    
    โ˜ Review cost summary โ€” any spikes?
    
    โ˜ Review failed tickets โ€” why did they fail?
    
    โ˜ Check CLAUDE.md โ€” anything to add or prune?
    
    โ˜ Check CLAUDE.local.md โ€” any new PR feedback to codify?
    
    โ˜ Review merge rate โ€” trending up or down?
    
    โ˜ Run: `hermes doctor` โ€” any dependency issues?
    
    โ˜ Purge old sessions: `hermes sessions prune --older-than 30`
    
    

    The Long-Term Trajectory

    Here’s what I’ve seen after 4 weeks:

    Metric Week 1 Week 4 Change Fix success rate 62% 85% +23% Avg turns per fix 12.4 7.1 -43% Reviewer REWORK rate 35% 14% -60% CLAUDE.md gotchas 3 12 +300% Cost per ticket $0.61 $0.38 -38% My time per ticket 15 min 3 min -80%

    The agent gets cheaper, faster, and more accurate the longer it runs. Not because the model improved โ€” because its knowledge of the project did.


    Series Summary

    Post What Key Insight 1 Vision Agent-driven dev works today 2 Hermes Setup One orchestrator (cron + delegation) 3 Claude Code Setup CLAUDE.md is compounding infrastructure 4 Multi-Agent Team Writer/Reviewer pattern, human-in-the-loop 5 Autonomous Pipeline Cron + classify + spawn + report 6 Monitoring Track costs, compound gotchas, upgrade when needed

    Getting Started Today

    1. Hour 1: Install Hermes + Claude Code. Get claude -p working. 2. Day 1: Write a solid CLAUDE.md with commands, architecture, gotchas. 3. Day 2: Create the cron job. Let it run overnight on a test repo. 4. Day 3: Add Reviewer + QC agents. Implement Writer/Reviewer. 5. Week 2: Connect to real GitHub Issues. Start with one bug class. 6. Week 3: Add cost tracking + weekly reports. 7. Week 4: Compound. Every mistake โ†’ CLAUDE.md. Every PR feedback โ†’ CLAUDE.local.md.

    The hardest part isn’t the code. It’s the discipline of writing down every gotcha and treating CLAUDE.md as a living document.


    This concludes the “AI Junior Developer” series. All posts at susiloharjo.web.id.

    Related: 9 Skills That Made My AI Junior Dev 10x Smarter.

    Related: I Let AI Run My Blog for a Month: What Broke and Worked.


    Discover more from Susiloharjo

    Subscribe to get the latest posts sent to your email.

    Discover more from Susiloharjo

    Subscribe now to keep reading and get access to the full archive.

    Continue reading