My homelab AI agent setup was costing $42/month in API calls alone — until I switched to local quantized models.
That number came straight from my OpenRouter billing dashboard for April 2026: 350,000 tokens across Claude 3 Haiku and Mistral Small, mostly from my personal agent that checks GitHub notifications, drafts tweets, and summarizes my daily reading. At $0.00012 per 1K tokens for Haiku and $0.00006 for Mistral, the math added up fast. I’d told myself local LLMs weren’t ready for prime time — too slow, too finicky, too much VRAM — until I hit a psychological wall: paying for something I could run myself felt like renting a bicycle when I owned a garage full of parts.
I decided to fix it. Over three weekends, I rebuilt my agent pipeline around Ollama, quantized Llama 3 models, and deliberate GPU time-slicing. The result? My monthly LLM API spend dropped to $0. building practical AI agents for real-world automation — My agent still handles the same tasks — sometimes faster, sometimes slower, but consistently useful. Along the way I learned concrete lessons about optimizing AI agent performance with structured prompts, quantization tradeoffs, containerized GPU sharing, and why “local first” doesn’t mean “local only.”
Here’s exactly what I did, what I measured, and what broke along the way.