60 Percent of My API Calls Were Cached. I Turned It Off.

60 Percent of My API Calls Were Cached. I Turned It Off.

It is Tuesday afternoon. I am looking at my Grafana dashboard. The cache hit rate says 60 percent. Six out of ten API requests are being served from Redis, not the database.

By every metric I learned, this should be a win. Cache hits are fast. Database queries are slow. The math is simple.

But my p95 latency went up 40 milliseconds after I added caching.

Not down. Up.

I spent three days chasing this. I added more cache. I tuned TTLs. I pre-warmed the cache with likely queries. Nothing helped. The more I cached, the slower things got.

Then I found the bug. It wasn’t in the cache layer. It wasn’t in the database. It was in my assumptions about what caching actually does.

The cache tax I didn't measure

Every cached request has to do three things:

1. Check if the key exists in Redis 2. Deserialize the JSON blob 3. Validate the data is still fresh

On my API, step 1 takes 2 milliseconds. Step 2 takes 8 milliseconds. Step 3 takes 5 milliseconds. That’s 15 milliseconds of overhead before I even know if I have a cache hit.

A direct database query on the same endpoint takes 12 milliseconds.

So a cache hit costs me 15ms. A cache miss costs me 15ms plus the 12ms database query — 27ms total.

At 60 percent hit rate, my average response time is: – 0.6 × 15ms + 0.4 × 27ms = 9ms + 10.8ms = 19.8ms

Without caching, every request is 12ms.

I added caching to make things faster. I made them 65 percent slower.

The workload I optimized for

The caching guide I followed assumed read-heavy traffic with stable data. User profiles. Product catalogs. Configuration settings. Things that change once a day and get read a thousand times.

My API doesn’t look like that.

I checked the last 24 hours of query logs. Here’s what I actually have:

– 45 percent of endpoints are write-heavy (user actions, state changes, notifications) – 30 percent are read-once (search results, filtered lists, temporal queries) – 25 percent are stable reads (user settings, account data)

I built a caching layer for the 25 percent. But every request pays the cache tax, even the 75 percent that can’t benefit from it.

The cache hit rate looked good in aggregate. But it was hiding the fact that I was optimizing for the wrong workload.

The invalidation storm

Then there’s the problem of keeping the cache fresh.

My user profile endpoint has a 5-minute TTL. Seems reasonable. Profiles don’t change that often.

But I have 12 different services that can update a user profile: the main app, the mobile API, the admin panel, the billing service, the notification service, and seven background jobs I forgot about.

Every time one of these services writes to the profile, it needs to invalidate the cache key. I was doing this with a cache invalidation webhook pattern — each service sends a DELETE request to the cache layer when it updates data.

Except half the time, the webhook didn’t fire. The service would update the database, return success to the user, and forget to invalidate the cache.

So users would update their profile. See the changes immediately. Then refresh the page 30 seconds later and see the old data again.

The cache was serving stale data 50 percent of the time on write-heavy endpoints.

What I tried before turning it off

I didn’t give up immediately. I tried the standard fixes:

Shorter TTLs. Dropped from 5 minutes to 30 seconds. This reduced stale data but increased cache misses so much that the average latency got worse.

Cache-aside pattern. Instead of writing through the cache, I’d invalidate on write and let reads repopulate. This helped with staleness but added complexity — now I had to handle cache misses gracefully and deal with thundering herd problems on popular keys.

Write-through caching. Every write updates both the database and the cache simultaneously. This eliminated staleness but doubled the write latency — now every write had to wait for Redis to confirm.

Pre-computed cache keys. I tried to predict what queries would be popular and pre-warm those keys. This worked for maybe 20 percent of traffic. The other 80 percent was too dynamic to predict.

Each fix solved one problem and created two new ones. The caching layer went from 200 lines of simple Redis get/set code to 800 lines of invalidation logic, fallback handling, and cache warming strategies.

The endpoints that actually benefit

After three days of debugging, I found the 25 percent of endpoints where caching made sense:

1. Static configuration data — feature flags, app settings, version info. These change maybe once a day and get read on every page load.

2. Aggregated analytics — dashboard charts, summary statistics. These are expensive to compute (multiple joins, large date ranges) but change infrequently.

3. External API responses — data I fetch from third parties with rate limits. Caching these saves both latency and API quota.

For these three categories, caching delivered 10x latency improvements. A dashboard query that took 800ms dropped to 15ms with a 1-minute cache.

But these endpoints were only 25 percent of my traffic. The other 75 percent was paying the cache tax with no benefit.

What I did instead

I turned off the global caching layer. Then I added caching back in three specific places:

1. Application-level caching for static data. Feature flags and config data now get cached in memory at application startup. No Redis round-trip. No serialization overhead. Just a local variable lookup.

2. Query-level caching for expensive aggregations. Instead of caching at the API layer, I cache at the query layer. The aggregation query checks a cache key before running. If the key exists and is fresh, it returns the cached result. If not, it runs the query and stores the result. This is explicit, opt-in caching — I only cache queries I know are expensive and stable.

3. HTTP-level caching for external consumers. I added proper HTTP cache headers (ETag, Last-Modified, Cache-Control) so API consumers can cache responses on their end. This pushes the caching responsibility to the clients that actually benefit from it, instead of me trying to guess their use cases.

The result: p95 latency dropped from 19.8ms back to 12ms. Code complexity dropped from 800 lines to 150 lines. And I still get caching benefits on the 25 percent of endpoints where it actually helps.

The lesson I should have learned earlier

Caching is not a performance optimization you add at the end. It’s a workload assumption you make at the beginning.

If your workload is read-heavy with stable data, caching is magic. If your workload is write-heavy or highly dynamic, caching is a tax you pay on every request.

The cache hit rate metric is dangerous. A 60 percent hit rate sounds good until you realize you’re optimizing for the wrong 60 percent.

Now when someone asks me about API caching, I ask three questions first:

1. What percentage of your endpoints are read-heavy? 2. How often does the underlying data change? 3. What’s the actual latency of a uncached request?

If the answers are “less than 50 percent”, “more than once per minute”, and “under 50 milliseconds” — don’t add caching. Optimize your database queries instead. Add an index. Fix your N+1 queries. Reduce your payload size.

Those optimizations help every request. Caching only helps some.

The config changes: what I actually removed

Here’s the Redis caching code I deleted:

# Before: global caching middleware
@app.middleware("http")
async def cache_middleware(request, call_next):
    cache_key = f"api:{request.url.path}:{request.query_params}"
    
    # Check cache (15ms overhead)
    cached = await redis.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Process request
    response = await call_next(request)
    
    # Store in cache
    await redis.setex(cache_key, 300, response.body)
    return response

Here’s what replaced it:

# After: explicit opt-in caching for specific queries
@cache(expire=60, key_func=lambda user_id: f"dashboard:{user_id}")
async def get_user_dashboard(user_id):
    # Expensive aggregation query
    return await db.execute(DASHBOARD_QUERY, {"user_id": user_id})

# In-memory config cache (no Redis)
CONFIG_CACHE = {}

async def load_config():
    global CONFIG_CACHE
    CONFIG_CACHE = await db.execute("SELECT * FROM config")

# HTTP cache headers for external consumers
@app.get("/api/users/{id}")
async def get_user(id: int):
    user = await db.get_user(id)
    return JSONResponse(
        content=user,
        headers={
            "ETag": f'"{user.updated_at}"',
            "Cache-Control": "max-age=60, stale-while-revalidate=300"
        }
    )

The second version is 60 percent less code. It’s also 40 percent faster on average because most requests skip the cache layer entirely.

When I would add caching again

I’m not anti-caching. I’m anti-blind-caching.

I would add a global caching layer again if: – My read/write ratio was 10:1 or higher – My data changed less than once per hour – My uncached latency was over 100ms

None of those were true for my API. So caching was the wrong tool.

The hardest part of this fix was admitting that the 60 percent cache hit rate I was proud of was actually a warning sign, not a success metric. It told me that 40 percent of requests were paying the cache tax with zero benefit.

Related: Homelab AI Agent Costs Down 60% with Ollama Quantized Models.

Related: When AI Agents Eat Your Server: Taming Rogue Processes.

Now my dashboards show p95 latency, not cache hit rate. And my API is faster for it.


Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.

Leave a Comment

Discover more from Susiloharjo

Subscribe now to keep reading and get access to the full archive.

Continue reading