I Stopped Self-Hosting AI: Why DeepSeek V4 Pro on Ollama Cloud Is My New Default
The most-said line in my group chats this week was three words: “I miss Fable.”
Not in a nostalgic way. In a “my entire workflow is broken” way.
Fable was the model I used for first-draft generation. Fast, cheap, good enough for 80 percent of the work. Then it vanished. No deprecation warning. No migration path. Just gone.
My first reaction was what a lot of people are doing now: go local. Buy a GPU, run llama.cpp, never depend on a vendor again. I spent $1,400 on a used RTX 4090. I downloaded 150GB of model weights. I learned to love the sound of my fans spinning at 80 percent.
For one month, self-hosting worked. Then the novelty wore off.
The 4090 draws 450W under load. My electricity bill went up $35. The 70B models I was running maxed out at 32K context — not enough for full codebase reviews. Batch processing hundreds of documents meant queuing jobs overnight. And when Opus 4.8 dropped with significantly better reasoning, I had no way to access it without going back to cloud anyway.
I was renting infrastructure, not avoiding vendors. The landlord just changed from Anthropic to NVIDIA.
Then I tried DeepSeek V4 Pro on Ollama Cloud. The pricing made me reconsider everything.
The Math That Changed My Mind (Again)
I ran the numbers for my actual usage pattern. Not theoretical, not worst-case — what I actually do day to day.
Self-hosted workflow (my 4090 setup): – Hardware: $1,400 one-time (RTX 4090 24GB) – Electricity: ~$35/month (4 hours/day inference at 450W) – Amortized over 3 years: ~$40/month – Total monthly cost: ~$75 – Context limit: 32K (hard cap at Q4 quantization) – Speed: 15-20 tokens/second on 70B models – Availability: 100 percent (it’s my hardware) – Model updates: Manual (I download new weights)
DeepSeek V4 Pro via Ollama Cloud: – Pricing: $0.27 per 1M input tokens, $1.10 per 1M output tokens – My average usage: 800K input + 200K output per day – Monthly cost: ~$130 – Context limit: 128K (native, no quantization) – Speed: 80-120 tokens/second (varies by load) – Availability: Depends on Ollama Cloud uptime – Model updates: Automatic (new versions roll out seamlessly)
The difference is $55 per month.
For that $55, I get: – 4x the context window (128K vs 32K) – 5-6x the inference speed – No hardware maintenance – No electricity spike – No 4-hour setup time for new models – Access to a model that benchmarks higher than my local 70B runs
The sovereignty argument still holds — I could self-host if I needed to. But for daily work, the cloud option is objectively better for my use case.
Why DeepSeek V4 Pro Specifically
I did not just pick the first cloud model I found. I tested five options against my actual workload:
| Model | Provider | Cost/Month | Context | Speed | Quality (my tasks) |
|---|---|---|---|---|---|
| Llama 3.1 70B | Local (4090) | $75 | 32K | 18 t/s | Baseline |
| Qwen 2.5 72B | Local (4090) | $75 | 32K | 14 t/s | Slightly better structure |
| DeepSeek V4 Pro | Ollama Cloud | $130 | 128K | 95 t/s | Best overall |
| Opus 4.8 | Anthropic | $380 | 200K | 110 t/s | Marginally better than DeepSeek |
| Gemini 3.5 | $290 | 1M | 100 t/s | Best for multimodal |
DeepSeek V4 Pro won on price-to-performance. It is not the absolute best model — Opus 4.8 edges it out on complex reasoning by about 10-15 percent in my testing. But Opus costs 3x more. For drafting, revision, code generation, and technical explanation (90 percent of my work), DeepSeek is indistinguishable from Opus.
The other factor: Ollama Cloud as a provider. They have been around long enough that I trust them not to vanish overnight like Fable did. They offer multiple models, so if DeepSeek V4 Pro gets deprecated, I can switch to Llama 3.2 or Qwen 3.0 without changing my workflow. That is the kind of redundancy I did not have with Fable.
The Workflow I Built After (Version 2)
My current setup is hybrid. Cloud-first for daily work, local as fallback, with clear rules for when to use each.
Primary (DeepSeek V4 Pro via Ollama Cloud): – First-draft generation – Code review and refactoring suggestions – Technical explanation and documentation – Email drafting and revision – Anything under 100K tokens
Fallback (Local 70B models): – When Ollama Cloud has an outage – When I need to process sensitive data that should not leave my machine – When I am traveling with poor internet – Quick classification tasks (Phi-3 Mini still lives on my machine)
Specialist (Other cloud providers): – Opus 4.8: High-stakes client work where the 10-15 percent quality edge matters – Gemini 3.5: Multimodal tasks (image + text together) – Groq Cloud: When I need 500+ tokens/second for batch processing
The key difference from my Fable days: no single point of failure. I have three independent providers (Ollama Cloud, Anthropic, Google) plus local inference. If any one of them disappears, I lose convenience but not capability.
Where This Setup Breaks (And How I Handle It)
Cloud-first is not strictly better. It has four failure modes I have already hit.
1. Provider outage. Ollama Cloud went down for 3 hours last Tuesday. I had a deadline. I switched to local inference and lost 6x speed, but the work got done.
My workaround: keep local models warm. I run a weekly health check — ollama run llama3.1:70b "test" — to make sure my fallback is ready. If Ollama Cloud is down for more than 2 hours, I pivot to local automatically.
2. Rate limits. DeepSeek V4 Pro has rate limits on Ollama Cloud. I hit them twice during batch processing runs (100+ documents in one session).
My workaround: queue management. I split large jobs into 20-document batches with 5-minute gaps. Slower, but avoids hitting the limit. For urgent bulk work, I switch to Groq Cloud (higher rate limits, different model).
3. Price changes. Ollama Cloud could raise prices tomorrow. DeepSeek V4 Pro could become paywalled. I have no control over this.
My workaround: the $55/month delta is my buffer. If prices go up by less than $55, cloud is still worth it for the speed and context gains. If they go up more, I revert to local full-time. I have the hardware — it is not sunk cost, it is insurance.
4. Model deprecation. DeepSeek V4 Pro could get sunset like Fable did. Ollama Cloud could drop it from their catalog.
My workaround: prompt portability. My system prompts are provider-agnostic. I do not use DeepSeek-specific features. If I need to switch to Llama 3.2 or Qwen 3.0, I change one environment variable and the workflow continues. I tested this when switching from Fable — it took 20 minutes to retarget, not 6 hours.
The Real Lesson (It Is Not About Cloud vs Local)
After Fable, I thought the answer was sovereignty. Own the hardware, own the weights, own the workflow.
After a month of self-hosting, I think the answer is something else: redundancy.
It does not matter if you run local or cloud. What matters is whether you have a working fallback when your primary breaks.
My 4090 is not a replacement for Ollama Cloud. It is insurance against Ollama Cloud failing. Ollama Cloud is not a replacement for my 4090. It is a better tool for 90 percent of my daily work.
The mistake I made with Fable was building my entire workflow around a single provider with no exit strategy. I am not making that mistake again.
The Three Rules I Use Now
Before I adopt any AI tool — cloud or local — I ask three questions:
1. What is my fallback if this disappears tomorrow? If the answer is “nothing,” I do not use it for core workflow. Fable failed this test. Ollama Cloud passes because I have local inference ready.
2. How much would it cost to switch? If retargeting my prompts would take more than an hour, the tool is too proprietary. DeepSeek V4 Pro passes — my prompts are provider-agnostic. Fable failed — I had Fable-specific tuning baked into dozens of prompts.
3. Am I paying for convenience or capability? Cloud costs more but gives me speed and context. Local costs less but gives me sovereignty and uptime. Both are valid. What matters is knowing which one I am buying and why.
These rules have slowed down my adoption of new tools. I skipped two hyped launches this month because they failed rule 1. But my workflow has not broken once since I implemented them.
The Takeaway
Fable dying taught me that vendor dependency is risky. Self-hosting taught me that sovereignty has a cost — not just money, but speed, context, and maintenance overhead.
DeepSeek V4 Pro on Ollama Cloud is not the perfect answer. It is the best answer for my current constraints: $130/month budget, 128K context needs, 90 percent drafting/revision/code work.
Your constraints might be different. Maybe you handle sensitive data and cannot use cloud. Maybe you need 1M token context and must use Gemini. Maybe you cannot afford $130/month and the 4090 math does not work.
The point is not to copy my setup. The point is to have a setup — with explicit fallbacks, known switch costs, and a clear understanding of what you are paying for.
Fable broke my workflow because I did not have any of those things. I do now.
For more on the cost side of self-hosting vs cloud, I documented my homelab AI cost breakdown here. And if you are building AI workflows, understanding RAG retrieval as filtering, not search is worth a read.
And if Ollama Cloud disappears tomorrow? I will switch to local inference within an hour. The work will continue. That is the only metric that matters.
Discover more from Susiloharjo
Subscribe to get the latest posts sent to your email.