Skip to main content

2 posts tagged with "nvidia"

View All Tags

TTFT-Driven Autoscaling for Disaggregated LLM Inference with NVIDIA Dynamo on AKS

· 16 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt
Mohamad Al Jazaery
Principal Solution Engineer, Azure Global Black Belt

Most inference autoscalers react to CPU or GPU utilization. But for large language models the metric that actually matters to users is Time To First Token (TTFT) — how long they wait before the response starts streaming. A GPU can be 60% utilized and still be delivering 30-second TTFT under a burst of long-context requests.

In this post I'll show how to wire NVIDIA Dynamo disaggregated inference together with KEDA on AKS so that the system autoscales the decode worker fleet directly on TTFT p99 — using Azure Managed Prometheus as the metric source and AKS-managed GPU drivers so there is no NVIDIA GPU Operator to maintain.

NVIDIA Dynamo on AKS: Disaggregated LLM Inference with H100 GPUs

· 15 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt
Mohamad Al Jazaery
Principal Solution Engineer, Azure Global Black Belt

You've got your AKS cluster, your GPU quota is approved, and you're ready to serve large language models. But picking the right inference stack — vLLM, TensorRT-LLM, SGLang, disaggregated vs. unified — can cost you days before your first token lands.

That's the gap NVIDIA Dynamo fills.