<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Inference Lab]]></title><description><![CDATA[Field notes on LLM inference and agentic AI. Runnable code in every post.]]></description><link>https://letters.rohankataria.com</link><image><url>https://substackcdn.com/image/fetch/$s_!evmN!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9b0053e-e450-4f32-98ec-68b6241459e6_400x400.png</url><title>The Inference Lab</title><link>https://letters.rohankataria.com</link></image><generator>Substack</generator><lastBuildDate>Wed, 06 May 2026 16:34:54 GMT</lastBuildDate><atom:link href="https://letters.rohankataria.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[The Inference Lab]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[runbook@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[runbook@substack.com]]></itunes:email><itunes:name><![CDATA[Rohan Kataria]]></itunes:name></itunes:owner><itunes:author><![CDATA[Rohan Kataria]]></itunes:author><googleplay:owner><![CDATA[runbook@substack.com]]></googleplay:owner><googleplay:email><![CDATA[runbook@substack.com]]></googleplay:email><googleplay:author><![CDATA[Rohan Kataria]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[H100s Are a Waste of Money (Until You Fix This One Bottleneck)]]></title><description><![CDATA[Most teams don&#8217;t have an H100 problem.]]></description><link>https://letters.rohankataria.com/p/h100s-are-a-waste-of-money-until</link><guid isPermaLink="false">https://letters.rohankataria.com/p/h100s-are-a-waste-of-money-until</guid><dc:creator><![CDATA[Rohan Kataria]]></dc:creator><pubDate>Fri, 12 Dec 2025 13:28:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!evmN!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9b0053e-e450-4f32-98ec-68b6241459e6_400x400.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most teams don&#8217;t have an H100 problem. They have a scheduling and scaling problem.</p><p>This week, we ran an experiment on a vLLM cluster with KEDA autoscaling. The setup looked fine on paper: H100 GPU instances, proper pod QoS, KEDA configured with target CPU at 70%. But something was eating the margin.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://letters.rohankataria.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Inference Lab! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>The Infrastructure Question</p><p>Can KEDA keep DeepSeek from thrashing H100s when traffic spikes?</p><p>The Environment</p><p>- vLLM on AKS with 8 H100 nodes</p><p>- KEDA HPA scaling pods by CPU utilization</p><p>- DeepSeek-7B inference, batch size 8</p><p>- Baseline cost: $890/day</p><p>What Actually Happened</p><p>After 90 minutes of steady load (500 req/min), utilization flatlined at 82%. The autoscaler stayed dormant. But tail latency exploded from 120ms to 950ms.</p><p>Looking at the traces:</p><p>- Pod CPU was spiking to 85%, triggering scale-up</p><p>- But KEDA&#8217;s cooldown window (300s) was killing the response time</p><p>- By the time new pods came online, the queue had built up</p><p>- Those new pods were getting cold-started with zero initialization</p><p>The Bottleneck</p><p>KEDA&#8217;s default cooldown was masking a deeper issue: our initial pod density was way too low. We had 8 vLLM replicas across 8 nodes. This meant zero redundancy during scale-up events.</p><p>The Fix (3 Changes)</p><p>1. Reduced KEDA cooldown from 300s to 45s. This keeps the feedback loop tight when scaling up.</p><p>2. Changed target metric from CPU (85%) to a custom metric: tokens-per-second-per-pod at 12,500 tokens/s. CPU lies. Throughput doesn&#8217;t.</p><p>3. Increased initial replicas from 8 to 12, giving headroom before KEDA even kicks in.</p><p>The Results</p><p>Same traffic, after these changes:</p><p>- Tail latency: 950ms &#8594; 140ms</p><p>- Node count stayed the same (8)</p><p>- Cost went from $890/day &#8594; $867/day (less thrashing)</p><p>- KEDA scale events dropped from 6 per 10min to 2 per 10min</p><p>The Real Lesson</p><p>H100s aren&#8217;t the problem. Bad scaling logic is. You can own the fastest GPU on earth and still trash your latency by letting KEDA guess about capacity.</p><p>Inside the Gated Section</p><p>If you want to see the exact KEDA spec, YAML, and Grafana queries used in this test, it&#8217;s locked below. We also included:</p><p>- The HPA scaling thresholds that worked</p><p>- The pod QoS settings that prevented node-level contention</p><p>- The custom metrics setup (KEDA + Prometheus)</p><p>- The mistakes we made (and cost of each)</p><p>Steal this config. Adapt it. Break it in your own lab. Reply with your traces if you try it&#8212;future experiments will incorporate what works.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://letters.rohankataria.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Inference Lab! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>