Why I Stopped Optimizing My AI Agent and Started Shipping It

Why I Stopped Optimizing My AI Agent and Started Shipping It

In the first quarter of this year, I spent 90 days tuning an AI agent. I rewrote its prompt three times. Changed the memory backend twice. Added a supervisor loop, removed it, added a different one. By March, the agent did exactly what I wanted. It also had zero users.

A friend shipped a similar tool in week two. Her version was rough — the prompt had typos, the memory was a single JSON file. It also had 200 paying users by month-end. I asked her secret. She said: “I just kept shipping the version that worked well enough for the next person to pay me.”

That sentence rearranged my brain. Here is what I learned from spending three months on what should have been a three-week project.

The optimization trap looks like progress

When you are building an AI agent, every tweak feels like forward motion. A different prompt template — eval goes from 0.71 to 0.74. Swap the embedding model — retrieval gets sharper. Add a guardrail — cost goes up, but quality goes up too. You tell yourself this is what shipping looks like.

It is not.

Shipping looks like someone using your thing in the wild, encountering a problem, and coming back the next day. The optimization loop has no such user. You are tuning for a metric you defined yourself, on a test set you curated yourself, against a competitor you imagined. The only honest feedback is revenue, retention, or a real human telling you the agent broke.

I have written about agent eval patterns before — in monitoring and continuous improvement and in the multi-agent tooling breakdown. None of that was wasted exactly. But I was confusing architectural depth with product traction.

The cost of a great-but-unused agent

Here is the part nobody talks about: an unused agent is not zero value. It is negative value. Every week you spend tuning is a week you are not talking to users. Every cycle spent on an eval is a cycle not spent on the actual problem. The longer the gap between “I built it” and “someone used it”, the more your own confidence degrades.

A shipped, rough, used-by-five-people agent is worth more than a perfect, used-by-nobody agent. The first one teaches you what to build next. The second one teaches you nothing because you are the only one touching it.

What I changed after the 90-day wake-up

I started treating the agent as a product from day one. Not as a research project. Four changes:

First, the prompt went in a file the user could read. If the user cannot see how the agent decides, they cannot trust it, and they cannot debug it when it fails. The first version of my new agent had a 12-line prompt with examples. That is what shipped. The 90-day version had a 200-line prompt with a personality guide and an “if-uncertain” tree. Both worked. The first one was inspectable.

Second, I picked a single happy-path metric and ignored the rest. For two weeks I optimized the agent on one real user task. Not an eval set. Not a benchmark. One task. One user. The agent solved it on the first try in 70 percent of attempts. I shipped that. I did not wait for 90 percent.

Third, I added a feedback channel before the second iteration. A simple “did this help, yes or no” button. Even with three users, that button is more data than I had in 90 days of internal benchmarking.

Fourth, I scheduled ship dates as immovable. If a feature was not in by Wednesday, it did not ship Friday. The first version was feature-frozen at week two. Everything after that was observability and bug fixes.

The pattern I now follow for any agent project

When I start a new agent, the first commit is not the prompt. It is the empty directory with a README that says: “By day 14, this will be used by 3 people for task X.” Everything else is a means to that end. The eval set is whatever those three people say in the first two weeks. The architecture is whatever the simplest path to that is. The memory backend is whatever I can ship in an afternoon.

The 200-line prompt can come later. The supervisor loop can come later. The fine-tuned retriever can come later. None of those are the product. The product is the human’s problem being solved. Everything else is plumbing.

This sounds obvious when I write it out. I have known this for years about every other kind of software. But AI agents are seductive because they look like research. Research rewards long quiet months of tuning. Products reward short loud weeks of shipping. If you are building an agent, decide which one you are doing by week two.

A short list of questions to ask before week three

If you are mid-project and not sure which mode you are in, ask yourself:

  • Have I shown the agent to a real person who is not me, in the last seven days?
  • Have I changed the agent’s behavior based on something a user said, in the last seven days?
  • Have I rejected a feature idea because it would not help the next paying user?
  • Have I deleted code because it was making the agent slower to ship?

If any of those is no, I am not shipping. I am tinkering. Tinkering is fine. But call it tinkering. Do not call it progress.

The takeaway

Ship the messy version. Ship the version with the typo in the prompt. Ship the version where the memory is a single JSON file. Ship the version where the eval is “I asked my friend if it worked.”

The cost of waiting is not just time. The cost of waiting is that the next person to ship in your space gets the user’s habits, gets the user’s trust, and gets the user’s data. Those compound. By the time your perfect agent lands, the user has already taught another product how to work for them.

I spent 90 days tuning. I shipped in week one of the next quarter. The shipped version is the one that has users. The tuned version is the one I have not touched in months.

That is the only eval that matters.


Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.

Leave a Comment

Discover more from Susiloharjo

Subscribe now to keep reading and get access to the full archive.

Continue reading