Inference Engines Pt. 1

The True Workhorses of AI

February 23, 2026

Model and model training get all the hype in AI; it is truly amazing that humanity can now produce models with trillions of parameters that encode so much of human knowledge. However, there is an unsung hero in moving the world forward: inference engines. When you interact with ChatGPT, Claude, Gemini, or Grok, you are interacting with an interconnected set of chips in a data center that store many copies of these trained models and process not only your response but also the responses of hundreds of thousands or millions of other users who may be interacting with that model at the same time.

What needs to happen when you make a request? At the most basic level, your input needs to be tokenized, and those tokens need to be fed through the many layers of the trained model and have its tokens generated auto-regressively (one token at a time) until the model decides that it's done.

The largest models will just barely fit on a single rack. As an example, NVIDIA's latest racks, the GB300 NVL72 racks, pack 36 Grace Blackwell Ultra Superchips, each containing two Blackwell Ultra GPUs with 288 GB of HBM3e apiece—72 GPUs total, for nearly 21 TB of GPU memory; moreover, they are connected by a proprietary NVIDIA technology, NVLink that allows these GPUs to effectively act as one single large GPU. Taking Grok 5 as an example of a model at the absolute frontier: Elon has said it will be a 6T parameter model, which if stored in FP16 is 12 TB of memory, so that it could fit on one of these "single GPU-like racks". With the 300K+ GPUs they claim to have in their data center, assuming they use 80% of that for inference, that means they'll have roughly 3,000 copies of Grok 5 running in parallel across the cluster, serving requests (the exact number depends on how many GB200s vs. GB300s they have)!

xAI Computing Cluster — Source: xAI All Hands, February 11, 2026

The KV cache

Typically, when interacting with ChatGPT for example, you ask a question, and under the hood, OpenAI will process not just your question but also the system prompt for ChatGPT. So two things are processed: 1) the system prompt and 2) your question. After ChatGPT produces its response, if you ask another question, it will feed back 1) the system prompt, 2) your question, 3) its response, and 4) your new question. You can see already that 1) and 2) are being "processed twice"; this naturally leads to the question of whether we can reduce this redundancy.

It turns out that in the transformer layers there are three intermediate quantities produced: keys, queries, and values. Each of these is produced through matrix multiplications. The bulk of "attention" involves the queries for each token looking at the keys and values of all past tokens. Thus, for the 1), 2), 3), and 4) example above, we can cache the keys and values for 1) and 2), so that when a user types a new message, those are simply read from memory on the GPU instead of being recomputed. This cache becomes the KV cache.

Much of doing inference well involves managing this KV cache: if some portion of the GPU's memory is used for the model weights, some other portion must be used for storing the KV cache and using it for computation.

KV cache memory management

Which KV cache you keep on the GPUs vs. offloading to the CPU becomes important. First-order algorithms such as keeping the most recently used "KV cache blocks" (more on this shortly) on the GPU are a good starting point.

Naively, when a GPU is computing KV cache for a sequence, it might pre-allocate the amount of GPU memory it thinks it will need to keep the entire KV cache for the sequence contiguous in a single block. This can lead to under-allocation, which can result in the expensive "allocate an even larger chunk of GPU memory, copy the data over, and free the old memory" operations, which can kill latency, or over-allocation, which can kill throughput.

Borrowing an idea from operating systems, storing the KV cache for individual sequences in small blocks which can then be moved into and out of GPU memory individually avoids these memory allocation issues and results in higher GPU memory utilization and thus higher throughput. This is a technique known as "PagedAttention", introduced by the team behind vLLM. The original PagedAttention paper showed that if KV cache is stored the naive way, GPU memory utilization can be as low as 20% of what it could be! This not only improves throughput; it also reduces the amount of "Time to First Token".

Breaking up the KV cache into blocks, and further storing it in clever data structures, can result in further optimizations. Suppose another user interacts with ChatGPT; if the inference engine as a whole stores a mapping between KV cache sequences that have already been computed, then another user can take advantage of the already-computed KV cache of that other sequence to move faster in producing the answer to its own query. This technique is called RadixAttention, introduced in SGLang, which maintains an LRU cache of the KV cache for all requests in a radix tree. When a new request comes in, we can quickly look up in this tree to see if the sequence of tokens in the new request corresponds to a chunk of KV cache that has already been computed.

Batching and the throughput-interactivity tradeoff

Speaking of multiple users: you are likely aware that neural networks are trained in batches; given that the bulk of neural network computation is matrix multiplication, and GPUs love matrix multiplications, batches are a great way to get more throughput in your system since computation time per sequence processed decreases with the number of sequences. We can apply a similar concept to inference: when getting a new request in, instead of processing it immediately, we can wait a (hopefully) small amount of time to batch it with other sequences so that our GPU can process those sequences in one batch.

This brings up a natural tradeoff: using small batch sizes means that for individual requests that come in, we can generate tokens very quickly. However, while an individual GPU is processing this request for a single user or small number of users, it is foregoing an opportunity to serve a larger number of users by waiting for that larger batch.

We say there is a fundamental tradeoff between operating inference engines with high interactivity - primarily "tokens per second per user", though there's a critical secondary component to interactivity discussed below - by using small batch sizes, versus high throughput - high tokens per second per GPU.

The economics

When GPU buyers, whether the major clouds like AWS, Azure, GCP, and OCI, or large companies like Meta or xAI, evaluate different chips for inference (NVIDIA GPUs, AMD GPUs, or custom silicon), the decision over a 3-to-5-year lifecycle ultimately boils down to a single, uncompromising metric: Cost per unit of "goodput". Everything else flows from this.

Goodput is: tokens produced while meeting target levels of both components of interactivity: 1) tokens per second, which is what users feel most acutely and is a function of decode, and 2) time to first token, which is the time that the prefill stage takes. You can think of Goodput as "throughput given that you are meeting your interactivity targets". Buyers want to know this Goodput divided by Total Cost of Ownership (TCO), which incorporates both the initial CapEx of the chips and networking fabric and the OpEx of power draw, cooling, and maintenance. This metric can "cut either way": an extremely powerful rack of chips that consumes a ton of power, fails often and requires highly-trained specialists to bring it back online when it does fail may have sufficiently high TCO to not make it worth it; on the other hand a cheap set of chips that simply can't churn out as many tokens per GPU because its chips aren't linked together with advanced networking such as NVLink/InfiniBand and thus can't power an inference engine to efficiently do the disaggregated prefill/decode routing we'll discuss shortly may not work either.

Prefill and decode

Now that we've covered the economics, let's go back to covering the software that makes inference engines work. As you may be able to extract from above, there are two "phases" of doing inference on a given request, and these phases are known as prefill and decode.

Prefill involves computing the KV cache for a request, given its tokens. Decode involves repeatedly reading the KV cache from HBM on the GPU into the Streaming Multiprocessors where the computation happens.

These have very different characteristics:

Prefill can be greatly parallelized since, given the way attention works, the KV caches for all tokens can be computed at once. Prefill is most sensitive to the length of the input passed in. A request to summarize a 100-page PDF in one page would have an extremely long prefill stage.
Decode can be parallelized much less. It is an autoregressive loop: for each new token, the GPU reads the existing KV cache (the stored keys/values for prior tokens), computes the next-token step (including attention using the new token’s query), and then appends the new token’s keys/values to the KV cache for the sequence. Unlike prefill, this must be done in sequence. Decode is most sensitive to the length of the output sequence.

Disaggregated serving

The fact that prefill and decode have such different characteristics leads to the idea of using separate GPU pools. If the inference engine starts getting more requests requiring extensive prefill, so that time to first token increases, the engine can shift resources from the decode pool to the prefill pool. If it starts getting requests requiring extensive decode, so that "time per output token" or "inter token latency" increases, it can similarly shift resources from prefill to decode.

Compared to doing both the prefill and decode stages on one GPU or rack of GPUs, having separate prefill and decode pools (which is known as "disaggregated serving") does introduce one additional step: the need to transfer the KV cache from the prefill stage to the decode stage once it has been computed.

NVIDIA's Dynamo inference framework includes NIXL, a library purpose-built for fast KV cache transfer between disaggregated prefill and decode pools. NIXL leverages GPUDirect RDMA over high-speed datacenter fabrics (like InfiniBand), allowing decode-pool GPUs to read KV cache directly from prefill-pool GPU memory with minimal latency, significantly mitigating the overhead of this extra transfer step. These technologies widen the gap between inference engines that use disaggregated serving and those that don't.

Parallelism strategies

The different compute requirements for the prefill and decode stages lead to another way we can "configure" the prefill and decode pools differently for optimal performance: we can use several kinds of parallelism for prefill, and, with very recent innovations in NVIDIA's racks, we can even apply some parallelism to decode to get more speedups there.

Because all tokens are computed in parallel during prefill, we can use multiple kinds of parallelism:

Tensor Parallelism: splitting large weight matrices across different GPUs or GPU racks.
Pipeline Parallelism: needed in setups where the model's weights do not fit in a single GPU or GPU rack. Different layers are put on different GPUs so that the entire model is served in a sort of "pipeline".
Expert Parallelism: many leading-edge models are now "Mixture of Experts" (MoE) models. This is an architecture where some layers in the model may have 10 or 20 different "experts"—sequences of computation, typically two linear layers in a row—with only 1–2 of those "experts" chosen by the model at each step. This leads to scenarios where a small fraction of the model's parameters are "activated" during generation of each token. For example, GLM 4.7, one of the most recent "buzzy" AI models, has 355B parameters with just 32B active at any time.

Prefill is "compute-bound" and generally likes GPUs configured for parallelism because we can overlap computation with communication. This means that, if we use Tensor Parallelism for example, while we are sending activations from GPU to GPU for computation (the "communication" step) we can be doing other computations relevant for the prefill stage. In practice, optimal prefill configurations use multiple parallelism strategies simultaneously: for a GPT 1.8T MoE model on 64 GPUs, NVIDIA found the best throughput at human reading speed using "TP2EP16PP2", meaning:

The model was split into two pipeline parallel groups.
Within those pipeline parallel groups, the expert weights were distributed across 16 different expert parallel groups, with each token routed to whichever group has the weights for the experts to which that token was routed.
Within each of these expert groups, each of the sets of expert weights is sharded across two GPUs; as it turns out, when expert parallelism is combined with tensor parallelism in this way, each expert parallel group keeps a copy of all the attention weights on its GPUs and shards them within these GPUs.

By contrast, decode is "memory-bound" and generally doesn't like parallelism: since it is a long series of small computations, we cannot overlap computation with communication in the same way. However, recent advances in GPU-to-GPU communication especially the NVLink fabric connecting the 72 GPUs of a Blackwell rack have made it so using "wide expert parallelism" during decode, though it may make decode for individual sequences slightly slower (lower interactivity), by putting fewer experts on each GPU, it allows for much greater throughput during decode. Specifically, with DeepSeek R1, NVIDIA found that moving from "EP8" to "EP32" resulted in a 1.8x increase in throughput at the same level of interactivity. A recent SemiAnalysis X post reinforces that many leading chip makers and model makers are realizing the benefits of wide-EP.

What's next

That's it for this blog post. In my next post, I'll cover two techniques that are critical to inference engines that work even without disaggregated serving: quantization and speculative decoding.