60 Percent of My API Calls Were Cached. I Turned It Off.

API caching performance

60 Percent of My API Calls Were Cached. I Turned It Off.

It is Tuesday afternoon. I am looking at my Grafana dashboard. The cache hit rate says 60 percent. Six out of ten API requests are being served from Redis, not the database.

By every metric I learned, this should be a win. Cache hits are fast. Database queries are slow. The math is simple.

But my p95 latency went up 40 milliseconds after I added caching.

Not down. Up.

I spent three days chasing this. I added more cache. I tuned TTLs. I pre-warmed the cache with likely queries. Nothing helped. The more I cached, the slower things got.

Then I found the bug. It wasn’t in the cache layer. It wasn’t in the database. It was in my assumptions about what caching actually does.

Read more

I Stopped Self-Hosting AI: Why DeepSeek V4 Pro on Ollama Cloud Is My New Default

Susiloharjo

I Stopped Self-Hosting AI: Why DeepSeek V4 Pro on Ollama Cloud Is My New Default

The most-said line in my group chats this week was three words: “I miss Fable.”

Not in a nostalgic way. In a “my entire workflow is broken” way.

Fable was the model I used for first-draft generation. Fast, cheap, good enough for 80 percent of the work. Then it vanished. No deprecation warning. No migration path. Just gone.

My first reaction was what a lot of people are doing now: go local. Buy a GPU, run llama.cpp, never depend on a vendor again. I spent $1,400 on a used RTX 4090. I downloaded 150GB of model weights. I learned to love the sound of my fans spinning at 80 percent.

For one month, self-hosting worked. Then the novelty wore off.

The 4090 draws 450W under load. My electricity bill went up $35. The 70B models I was running maxed out at 32K context — not enough for full codebase reviews. Batch processing hundreds of documents meant queuing jobs overnight. And when Opus 4.8 dropped with significantly better reasoning, I had no way to access it without going back to cloud anyway.

I was renting infrastructure, not avoiding vendors. The landlord just changed from Anthropic to NVIDIA.

Then I tried DeepSeek V4 Pro on Ollama Cloud. The pricing made me reconsider everything.

Read more

Opus 4.8 Plans, Gemini 3.5 Executes — I Sit in Middle

Susiloharjo

For the last six weeks I have been running my project work through a two-agent loop, and it has changed how I think about AI assistants. Opus 4.8 plans. Gemini 3.5 executes. I sit between them as the human in the loop, and the work gets faster and cleaner than any single-agent setup I have run before.

This is what the flow looks like, what each model is actually good at, and where the loop breaks when I push it too hard.

Read more

RAG Retrieval Is Filtering, Not Search.

Susiloharjo

I have been building RAG pipelines for two years. The mental model I started with was wrong, and reading Angela Shi’s article “Retrieval Is Filtering, Not Search” on Towards Data Science this week made the fix click.

The standard framing of RAG retrieval is “find the passages most similar to the query.” That framing is misleading because it imports the wrong mental model. Retrieval is not a Google-style search across unstructured text. It is a filtering problem on structured tables. The closer mental model is a SQL query, not a Google search.

This is the article that should have existed when I started. Here is what I learned, and what I am changing in my own RAG pipelines because of it.

Read more

AI Wrote 80% in 10 Minutes. The Last 20% Took 6 Hours.

Abstract dark coding keyboard representing AI-generated code

I shipped a feature on a Tuesday that took 11 minutes end-to-end. The agent generated the happy path, ran the tests, opened the PR. I clicked merge. Done before lunch.

The same agent shipped a feature on a Friday that took me 6 more hours after the agent finished. The happy path looked identical. The difference was the last 20%.

That gap is what this post is about.

Read more