Post 6: Monitoring & Continuous Improvement — Making Your Agent Smarter Every Week

May 31, 2026 · 5 min read

You’ve built the pipeline. The cron job fires. PRs appear while you sleep. It works.

Now the question: does it keep working? And does it get better?

An unmonitored agent is a liability. It might silently produce bad code, burn through API credits, or miss entire classes of bugs. A monitored agent gets better every week.

Layer 1: Cost Tracking

Every Claude Code run in print mode with --output-format json returns structured cost data:


{

"session_id": "75e2167f-...",

"num_turns": 7,

"total_cost_usd": 0.3415,

"duration_ms": 18423,

"modelUsage": {"claude-sonnet-4-6": {"costUSD": 0.3415}}

}

Extract and log this for every ticket:


def track_cost(agent_name, output_json):

data = json.loads(output_json)

return {

"agent": agent_name,

"turns": data.get("num_turns", 0),

"cost": data.get("total_cost_usd", 0),

"model": list(data.get("modelUsage", {}).keys())[0],

"status": data.get("subtype", "unknown")

}

Store in CSV or SQLite, then run a nightly summary cron job:


hermes cron create "0 19 * * *" \

--name "cost-summary" \

--prompt "Read today's cost CSV. Summarize: total spent, per-agent breakdown, number of tickets, average cost per ticket. Send to Telegram."

You’ll get nightly reports like:


💰 Cost Report — May 31, 2026

• Total: $1.23

• Tickets: 3 (all fixed)

• Avg cost/ticket: $0.41



Breakdown:

Coder: $0.82 (67%)

Reviewer: $0.29 (24%)

QC: $0.09 (7%)

Layer 2: Weekly Health Check

Every Monday, a cron job reviews the past week:


hermes cron create "0 8 * * 1" \

--name "weekly-health-check" \

--prompt "Review the past 7 days of agent costs and PRs.

1. Success rate: what % of tickets were fixed?

2. Rework rate: what % got REWORK from reviewer?

3. Average turns per fix (are we getting more efficient?)

4. Any patterns in failed tickets?

5. Recommendation: upgrade/downgrade models?"

Layer 3: Compounding CLAUDE.md

This is the most important improvement mechanism. From Boris Cherny, creator of Claude Code:

**"Let Claude write rules for itself. Any time Claude does something wrong, tell it: 'Update CLAUDE.md so you do not repeat this.'"**

After every PR review, check if the reviewer found issues:


def update_claude_md_from_review(review_output, issue_num):

violations = parse_convention_violations(review_output)



if violations:

with open("/home/hermes/erp-app/CLAUDE.md", "a") as f:

f.write(f"\n# Added from issue #{issue_num} review:\n")

for v in violations:

f.write(f"- {v}\n")

What this looks like in practice:

Week 1: Claude writes raw SQL → CLAUDE.md updated: “Never inline raw SQL, use queries.py”

Week 2: Claude uses naive datetime → “Date filtering uses timezone-aware objects”

Week 3: Claude forgets auth-failure test → “All new endpoint tests must include auth-failure case”

By week 4, Claude knows 12 project-specific gotchas. The same prompts produce dramatically better output.

Layer 4: Model Upgrade Decisions

Scenario Current Upgrade To Why Simple CRUD bug Haiku ✓ — Don't overpay Logic bug, multi-file Sonnet Opus Better reasoning Security/auth bug Sonnet Opus + manual review Never auto-merge auth Refactoring 5+ files Sonnet Opus + plan mode Explore before executing

Decision rule: If a bug takes >15 turns with Sonnet, try Opus next time for that bug class. If always <5 turns, downgrade to Haiku.

Layer 5: Human-in-the-Loop Feedback

Every time you review an agent PR and leave feedback, capture it:


After reviewing PR #523:

"You used Optional[T] but we prefer T | None in this codebase"

→ Add to CLAUDE.local.md: "Use T | None, not Optional[T]"

Merge Rate Tracking:

Healthy pipeline: >70% of agent PRs merged within 24 hours.

If merge rate drops below 50%:

CLAUDE.md is stale or wrong

Reviewer isn’t catching issues

Agent is attempting bugs it can’t fix

The Weekly Maintenance Checklist

Every Monday, 15 minutes:


☐ Read the weekly health report (automated)

☐ Review cost summary — any spikes?

☐ Review failed tickets — why did they fail?

☐ Check CLAUDE.md — anything to add or prune?

☐ Check CLAUDE.local.md — any new PR feedback to codify?

☐ Review merge rate — trending up or down?

☐ Run: `hermes doctor` — any dependency issues?

☐ Purge old sessions: `hermes sessions prune --older-than 30`

The Long-Term Trajectory

Here’s what I’ve seen after 4 weeks:

Metric Week 1 Week 4 Change Fix success rate 62% 85% +23% Avg turns per fix 12.4 7.1 -43% Reviewer REWORK rate 35% 14% -60% CLAUDE.md gotchas 3 12 +300% Cost per ticket $0.61 $0.38 -38% My time per ticket 15 min 3 min -80%

The agent gets cheaper, faster, and more accurate the longer it runs. Not because the model improved — because its knowledge of the project did.

Series Summary

Post What Key Insight 1 Vision Agent-driven dev works today 2 Hermes Setup One orchestrator (cron + delegation) 3 Claude Code Setup CLAUDE.md is compounding infrastructure 4 Multi-Agent Team Writer/Reviewer pattern, human-in-the-loop 5 Autonomous Pipeline Cron + classify + spawn + report 6 Monitoring Track costs, compound gotchas, upgrade when needed

Getting Started Today

1. Hour 1: Install Hermes + Claude Code. Get claude -p working. 2. Day 1: Write a solid CLAUDE.md with commands, architecture, gotchas. 3. Day 2: Create the cron job. Let it run overnight on a test repo. 4. Day 3: Add Reviewer + QC agents. Implement Writer/Reviewer. 5. Week 2: Connect to real GitHub Issues. Start with one bug class. 6. Week 3: Add cost tracking + weekly reports. 7. Week 4: Compound. Every mistake → CLAUDE.md. Every PR feedback → CLAUDE.local.md.

The hardest part isn’t the code. It’s the discipline of writing down every gotcha and treating CLAUDE.md as a living document.

This concludes the “AI Junior Developer” series. All posts at susiloharjo.web.id.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.