Skip to main content

TTFT-Driven Autoscaling for Disaggregated LLM Inference with NVIDIA Dynamo on AKS

· 16 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt
Mohamad Al Jazaery
Principal Solution Engineer, Azure Global Black Belt

Most inference autoscalers react to CPU or GPU utilization. But for large language models the metric that actually matters to users is Time To First Token (TTFT) — how long they wait before the response starts streaming. A GPU can be 60% utilized and still be delivering 30-second TTFT under a burst of long-context requests.

In this post I'll show how to wire NVIDIA Dynamo disaggregated inference together with KEDA on AKS so that the system autoscales the decode worker fleet directly on TTFT p99 — using Azure Managed Prometheus as the metric source and AKS-managed GPU drivers so there is no NVIDIA GPU Operator to maintain.

NVIDIA Dynamo on AKS: Disaggregated LLM Inference with H100 GPUs

· 15 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt
Mohamad Al Jazaery
Principal Solution Engineer, Azure Global Black Belt

You've got your AKS cluster, your GPU quota is approved, and you're ready to serve large language models. But picking the right inference stack — vLLM, TensorRT-LLM, SGLang, disaggregated vs. unified — can cost you days before your first token lands.

That's the gap NVIDIA Dynamo fills.

Continuous Profiling on AKS with Pyroscope, Blob Storage, and Managed Grafana

· 20 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt
Post Updates

2026-05-20 — Updated based on lessons learned from a live deployment:

  • Removed hardcoded pyroscope.image.tag from values-azure.yaml to prevent chart/image version mismatches when the chart is upgraded
  • Added pyroscope.extraLabels with azure.workload.identity/use: "true" to propagate the label to all pod templates (the chart uses extraLabels, not podLabels)
  • Pinned --version 2.0.1 in the helm upgrade --install command
  • Added a Troubleshooting callout documenting the two most common crash patterns and their fixes

You deploy your workloads on AKS and collect metrics with Prometheus and logs with Loki. But when latency spikes hit, you stare at dashboards knowing something is slow without knowing where in your code the time is being spent.

That's the gap continuous profiling fills.

ARO Storage Accounts: Under the Hood

· 8 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt

You create an Azure Red Hat OpenShift cluster, and minutes later, you notice something interesting in the managed resource group: two storage accounts with cryptic names like cluster1a2b3c4d5e and imageregistry1a2b3c4d5e.

What are they for? Why two? And what happens if you accidentally delete one?

Tools of the Trade: Working with Multiple Clusters

· 5 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt

Welcome to "Tools of the Trade" - a series where we share the tools and workflows that help us work more effectively. In this first post, I'll show you how I manage multiple AKS clusters without losing track of which cluster I'm working on. If you've ever accidentally deployed to the wrong cluster, this one's for you.

Deploying Azure Red Hat OpenShift with Managed Identities

· 6 min read
Diego Casati
Principal Cloud Architect, Azure Global Black Belt

When deploying Azure Red Hat OpenShift (ARO) clusters, managing authentication and authorization for various cluster components traditionally relies on service principals or other credential-based approaches. This introduces operational overhead and potential security risks related to credential rotation and management.

Getting started with Anyscale running on Azure

· 12 min read
Steve Griffith
Principal Cloud Architect, Azure Global Black Belt

In this walk through, we'll set up a very basic AKS cluster to get you quickly up and running with the Anyscale platform using AKS as the compute backend. We'll run this cluster in our own Azure Virtual Network and have it connected to an Azure Blob Storage Account on that VNet. Finally, we'll execute the basic Anyscale 'Hello World' demo on that compute.

When Infrastructure Scales But Understanding Doesn't

· 7 min read
Ray Kao
Principal Cloud Architect, Azure Global Black Belt
Diego Casati
Principal Cloud Architect, Azure Global Black Belt

We all know this, even if we don't like to admit it: modern infrastructure can scale infinitely, but human understanding doesn't.

We've all seen it happen—organizations going from managing dozens of servers to thousands of containers, from deploying weekly to deploying hundreds of times per day, from serving thousands of users to millions. The technology handled the scale beautifully. The humans? Not so much.

This is the first industry issue that platform engineering should be addressing: how do we manage infrastructure complexity that has outgrown not just individual cognitive capacity, but our collective ability to communicate and transfer knowledge as teams?