I shipped a feature on a Tuesday that took 11 minutes end-to-end. The agent generated the happy path, ran the tests, opened the PR. I clicked merge. Done before lunch.
The same agent shipped a feature on a Friday that took me 6 more hours after the agent finished. The happy path looked identical. The difference was the last 20%.
That gap is what this post is about.
The 47 features
I have used Claude Code as my main code-writing tool since the start of the year. After month three, I started tracking time. Two numbers per feature: generation time, from first prompt to “here is the diff, want me to open a PR?”, and ship time, from PR open to merge with all checks green. I kept both numbers in a simple spreadsheet.
47 features later, the split is almost always 80/20. Give or take 10 points.
I expected the ratio to change as I got better at prompting. It did not. The agent got faster. I got faster. Both of us moved. The ratio did not. That is the part that took me by surprise.
Some examples from the spreadsheet:
– A user settings form. 4 minutes to generate, 38 minutes to ship. The form worked on day one. The 38 minutes was timezone handling for two Singapore users, and a “secondary email” field the prompt never mentioned. – A webhook receiver. 12 minutes to generate, 4 hours to ship. Receiver worked on first deploy. The 4 hours was idempotency keys for a payment provider that retries on 2xx timeout, and a dead-letter queue for when the retry also fails. – A CSV export. 6 minutes to generate, 2 hours to ship. Export worked on first deploy. The 2 hours was a date filter that broke across month boundaries, and a BOM character that Excel on Mac refused to render. – A reporting query. 18 minutes to generate, 9 hours to ship. Query worked on the sample dataset. The 9 hours was a partition strategy that hit a hot shard in production, two missing indexes, and a permission issue my dev role hid.
The agent wrote the right code for the prompt I gave it. The delay was in everything I had not told the agent, because I had not thought about it yet. Domain knowledge. Edge cases from past bugs. Things I know so well I forget to mention them.
That is the 20%. It is not in the prompt. It is in the parts of the problem I did not mention.
What the last 20% actually is
After 47 features, the 20% reliably clusters into 5 categories. Every feature I ship hits all 5. Some hit them hard, some barely, but none skips a category entirely.
Empty state. What does the page look like when the user has nothing? New account, empty database, fresh tenant, first run. The agent assumes the data is there because the prompt says “show the user’s invoices.” Real users show up with zero invoices. The agent does not write the empty-state UI. You find out three days after launch from a support email. You spend 40 minutes writing the empty state.
Error handling. What happens when the network fails? When the third-party API returns 500? When the database connection drops mid-query? The agent writes the happy path. The agent assumes everything succeeds. Every try-catch, every fallback UI, every “what does the user see when this breaks” decision is yours. For the webhook receiver above, the agent generated 80 lines. I added 140 lines of error handling and dead-letter logic before it was production-ready.
Domain-specific edge cases. The agent does not know that “empty” means three different things in three different parts of the ERP. It does not know the Indonesian payment format needs a different parser. It does not know about the legacy data with the old format. It does not know about the enterprise customer who uses the product with a regional config nobody told the agent about. I know these things because I have been debugging them for two years. The agent has never heard of them.
Performance cliff. The agent writes code that works on the example you gave it. It does not stress-test for scale. The reporting query worked on 50 rows. It did not work on 5 million rows because the planner picked a sequential scan on a freshly partitioned table. The webhook receiver worked on 100 requests per minute. It did not work on 10 per second because the idempotency cache was an in-memory dict that crashed the worker after 200 MB.
Maintainability tax. I notice this one later. The agent writes code for today. Three months from now, when the requirements shift, the abstraction the agent chose does not fit. Refactoring costs more than rewriting would have. I have done this twice in the last six months. Both times I regretted not writing the more verbose version.
The 4 things I changed
I tried a lot of things. Most did not work. These 4 did.
I budget 4x. When the agent says “this is a 10-minute feature,” I plan for 40. I have not been wrong about this yet. The agent has gotten faster. My estimate of ship time has not. The 4x is not pessimistic. It is just the pattern.
I prompt for the unhappy path first. Before the agent writes the happy path, I add to the prompt. “What should this look like when the input is empty?” “What should this look like when the network fails?” “What should this look like when the user does something you did not anticipate?” The agent will not think of these on its own. If I name them, it takes a pass. The pass is not great. But it gives me a starting point instead of a blank page.
I write the failure tests first. I resisted this longest because it felt slow. Then I tried it for two weeks and I am not going back. What would break this? What would a real user do that I did not anticipate? I write those tests first, so the agent has a target when it generates the code. The tests catch about 70% of what would have eaten my ship time. The other 30% still show up, but I find them during test-writing. Not after I clicked merge.
I keep a 20% journal. One line per feature. “The last 20% of [feature] was [what I spent the time on].” I have 47 entries. The first 10 are mostly empty-state and error-handling. The middle 20 are domain edge cases. The last 17 are split between performance and maintainability. The pattern is consistent enough that I now know which category to expect. Webhooks are almost always error handling. Reports are almost always performance. Exports are almost always date formats.
The one rule
Before I open the agent on any feature, I ask one question: “What is the user going to do that I am not thinking about?”
If I cannot answer in 10 seconds, I do not open the agent. I sit with the question instead. Sometimes the answer is “nothing, this is simple.” Sometimes the answer is “oh, the user will import 50,000 rows from a CSV.” When the answer is the second one, I add the CSV import to the prompt first.
This rule has saved me the most time. Not because the prompt gets longer. Because I think first. The 20% is the parts of the problem I did not mention. Best way to find them is to ask before generating, not after.
I am not saying the agent is bad. The agent is the reason I shipped 47 features in 6 months instead of 12. The 80% in 10 minutes is real. I would not go back to writing it by hand. But the 20% is real too. If I pretend it is not there, my velocity numbers do not match my actual ship time.
How fast I can type a prompt is not the same as how long until the feature is in production. The agent makes the first number small. The second number is what actually matters.
If you have tracked your own 80/20 numbers, send them my way. I have compared notes with three other engineers and the pattern looks similar. That is a small group. More data would either confirm the 80/20 rule is universal, or show where it breaks.
Discover more from Susiloharjo
Subscribe to get the latest posts sent to your email.