Quantization and Speculative Decoding
March 4, 2026
In Part 1, we defined what inference engines were and showed some techniques fundamental to making them work:
We also covered the interrelated ideas of:
Finally, we teased that in this blog post we would cover two concepts: quantization and speculative decoding, and how they can further enhance inference.
Quantization involves storing numeric components of the inference engine, most importantly:
in a lower precision number format that takes up less memory. The result is that:
That's the basic idea: we can achieve much stronger throughput / interactivity while only sacrificing a small amount of accuracy.
When quantizing a high range number format like FP16 down to a lower range number format like FP8, INT4, or NVFP4 we need to do a few things at once:
Thus, different quantization techniques provide tradeoffs between additional memory usage (by saving more scaling factors) primarily as well as code complexity secondarily on the one hand, and information preservation on the other hand. This is precisely analogous to how quantization itself presents a tradeoff between accuracy and memory savings at a higher level.
Decode is autoregressive: we have to generate tokens one at a time. What if we could work around this bottleneck, producing multiple tokens at once? Multiple variants of "speculative decoding" aim to do this.
The standard speculative decoding variant runs another model in parallel with the main model. This smaller model may have ~1/10 the parameters of the main model and thus will take ~1/10 of the time to produce a single token. This model can be trained to predict multiple future tokens at once: suppose it has been trained to predict the next five tokens, for example. Each time we need one single new token, this smaller model predicts the next five; the larger model can then "verify" all five of these tokens in a single forward pass, taking only slightly more time to do this verification than it would take to produce one single new token and taking much less time than it would take to produce even two tokens.
The larger model checks—in parallel—if it agrees with each of the five tokens the smaller model has produced. It tries to "accept" as many tokens as it can, beginning with the first one. "Accepting" here basically means that it agrees that that token is the most likely token (there are other techniques that allow the model to accept tokens with some probability even if they are not the most likely one). If the larger model accepts the first three tokens the smaller model has proposed, for example, it then only needs to produce the last two. This can mean that, taken together, in expectation, tokens take less time to be produced, speeding up the "ITL" portion of decode.
This is "classical" speculative decoding. There are other more recent variants that greatly optimize this:
Here are blog posts where you can read more: