H100s Are a Waste of Money (Until You Fix This One Bottleneck)

Rohan Kataria — Fri, 12 Dec 2025 13:28:06 GMT

Most teams don’t have an H100 problem. They have a scheduling and scaling problem.

This week, we ran an experiment on a vLLM cluster with KEDA autoscaling. The setup looked fine on paper: H100 GPU instances, proper pod QoS, KEDA configured with target CPU at 70%. But something was eating the margin.

The Infrastructure Question

Can KEDA keep DeepSeek from thrashing H100s when traffic spikes?

The Environment

- vLLM on AKS with 8 H100 nodes

- KEDA HPA scaling pods by CPU utilization

- DeepSeek-7B inference, batch size 8

- Baseline cost: $890/day

What Actually Happened

After 90 minutes of steady load (500 req/min), utilization flatlined at 82%. The autoscaler stayed dormant. But tail latency exploded from 120ms to 950ms.

Looking at the traces:

- Pod CPU was spiking to 85%, triggering scale-up

- But KEDA’s cooldown window (300s) was killing the response time

- By the time new pods came online, the queue had built up

- Those new pods were getting cold-started with zero initialization

The Bottleneck

KEDA’s default cooldown was masking a deeper issue: our initial pod density was way too low. We had 8 vLLM replicas across 8 nodes. This meant zero redundancy during scale-up events.

The Fix (3 Changes)

1. Reduced KEDA cooldown from 300s to 45s. This keeps the feedback loop tight when scaling up.

2. Changed target metric from CPU (85%) to a custom metric: tokens-per-second-per-pod at 12,500 tokens/s. CPU lies. Throughput doesn’t.

3. Increased initial replicas from 8 to 12, giving headroom before KEDA even kicks in.

The Results

Same traffic, after these changes:

- Tail latency: 950ms → 140ms

- Node count stayed the same (8)

- Cost went from $890/day → $867/day (less thrashing)

- KEDA scale events dropped from 6 per 10min to 2 per 10min

The Real Lesson

H100s aren’t the problem. Bad scaling logic is. You can own the fastest GPU on earth and still trash your latency by letting KEDA guess about capacity.

Inside the Gated Section

If you want to see the exact KEDA spec, YAML, and Grafana queries used in this test, it’s locked below. We also included:

- The HPA scaling thresholds that worked

- The pod QoS settings that prevented node-level contention

- The custom metrics setup (KEDA + Prometheus)

- The mistakes we made (and cost of each)

Steal this config. Adapt it. Break it in your own lab. Reply with your traces if you try it—future experiments will incorporate what works.

The Inference Lab

H100s Are a Waste of Money (Until You Fix This One Bottleneck)