I really wanted self-hosting to work. The idea of running DeepSeek on my own GPU sounded amazing — no API bills, no rate limits, complete control. So I rented two A100s and tried it for a month. Here's the real math.
API vs Self-Host: The Numbers
| Volume (tokens/day) | API Cost (V4 Flash) | Self-Host GPU Cost | Winner |
|---|---|---|---|
| 100K | $0.75/mo | $400/mo | API by 533x |
| 1M | $7.50/mo | $500/mo | API by 67x |
| 10M | $75/mo | $800/mo | API by 11x |
| 100M | $750/mo | $1,500/mo | API by 2x |
| 1B | $7,500/mo | $5,000/mo | Self-host by 1.5x |
| 10B | $75,000/mo | $12,000/mo | Self-host by 6x |
The Hidden Costs of Self-Hosting
GPU idle time: your model isn't serving requests 24/7. At my volume, the GPU was idle 80% of the time, but I was paying for 100%.
Engineering time: I spent roughly 40 hours setting up vLLM, Nginx, monitoring, and autoscaling. At a developer rate of $100/hour, that's $4,000 in setup costs.
Maintenance: CUDA updates, security patches, model version upgrades. Budget 5-10 hours/month.
When APIs Win
Until you hit 1 billion tokens per day (roughly 100K daily active users on a chat-heavy app), APIs are cheaper once you factor in all costs. The code is simpler too:
# API: 3 lines
client = OpenAI(api_key="ga_...", base_url="https://global-apis.com/v1")
resp = client.chat.completions.create(model="deepseek-ai/DeepSeek-V4-Flash", messages=[...])
# Done. No GPU management, no scaling, no idle time.
More details in my full comparison. All API access via Global API.