I run LLM inference experiments and write down what I find.

Not benchmarks. Not “intro to transformers.” Actual field notes from deploying models in environments that weren’t designed for them—air-gapped clusters, on-prem GPUs, systems that can’t reach the internet.

Most AI deployment content assumes you’re on AWS with root access and a generous cloud bill. I write for the other case.

What shows up here:

Reproducible experiments with real configs. Numbers from actually running things, not estimates from a README. Failure modes I hit in the wild and spent too long debugging. Every post has code you can run on your own cluster.

Topics I keep coming back to: KV cache behavior at scale, vLLM and SGLang on non-standard hardware, speculative decoding tradeoffs, multi-agent coordination overhead, what happens to inference latency when your network isn’t reliable.

Who subscribes:

Engineers who own inference budgets or latency SLAs. People deploying to environments with real constraints—hardware, network, regulatory. Practitioners who want the implementation, not the overview.

If you’re the person on the team who actually knows what a KV cache is and why it matters, you’re probably in the right place.

2–3 posts per week. Runnable code in every post. No vendor drama.

User's avatar

Subscribe to The Inference Lab

Field notes on LLM inference and agentic AI. Runnable code in every post.

People