A "Swap Score" For Evaluating Models When Stability Relative to a Baseline is Important

Why quantization and distillation benchmarks should report the Weidman Swap Score in addition to accuracy change

February 2, 2026

Diagram showing model disagreement sets between two versions of a model.

Single numbers, or even sets of numbers, about a dataset or a model's performance rarely tell the whole story. Anscombe's Quartet is a classic example of this: these four datasets visualize the relationship between two variables. In each of the four cases, the variables have the same means and variances, and even the same correlation coefficients with each other, while visually you can see four vastly different relationships.

Confusion matrices for two scenarios with an identical accuracy decrease — Source: Wikipedia

At times the AI industry seems not to to have internalized this, relying too heavily on single numbers to characterize a single system's performance or a comparison between two systems. One scenario where this is apparent is model quantization; here, we care not only about the absolute performance of the model with quantized weights, but also its stability relative to the baseline model. For this and similar situations, we propose a metric, the Weidman Swap Score, to quantify this behavior deviation; this score can be computed easily and is relevant in any situation where stability of a modified model relative to a baseline is important.

Motivation

A recent NVIDIA blog post inspired the "chain-of-thought" that led to this post (I pick on NVIDIA only because I read their posts voraciously and consider them the single best source of information on where the industry is going). In it, they illustrate the performance of their NVFP4 number format by showing that (along with other benchmarks) quantizing the "0528" checkpoint of DeepSeek R1 to it, from FP8, only decreases accuracy on MMLU-Pro by 1%, from 85% to 84%. What they don't say is that that accuracy drop could occur under two very different scenarios "under-the-hood", scenarios with significantly different implications for the stability and reliability of the quantized model in production.

The two scenarios

Suppose this 85% to 84% drop was on a 100-question dataset. Either of the following two underlying scenarios could lead to a 1% accuracy drop (for concision, we'll call the baseline model V1 and the quantized model V2):

Scenario #1: "minimal swapping"

All 15 questions V1 got wrong, V2 also got wrong. V2 simply gets one additional question wrong.
Thus, V1 and V2 disagree on 1 question out of 100.

Scenario #2: "maximal swapping"

The 15 questions that V1 got wrong, V2 gets right. However, V2 gets 16 new questions wrong that V1 got right.
V1 and V2 disagree on 15 + 16 = 31 questions out of 100.

It can be easier to follow these scenarios by seeing the confusion matrices:

Despite one scenario involving disagreement on 1% of the dataset and the other involving disagreement on 31% of the dataset, in both cases we observe the same headline accuracy change: 85% to 84%. This is an Anscombe’s Quartet-style situation: same top-level numbers, very different underlying realities.

To be clear, I'm not claiming that NVIDIA is covering anything up by not reporting some kind of swap score: it’s possible that, compared to other ways of quantizing models from 8 to 4 bits, NVFP4 induces fewer swaps! In their blog post they are simply following the industry standard by only reporting top-level accuracy changes. In the next section, I provide them (and others) a single score that can help measure stability in addition to the overall performance change.

A Swap-Based Stability Metric

In this section I propose a metric of how stable one model is relative to a baseline ;in the AI context, I envision this being applied to measure how stable quantized or distilled models are relative to a baseline. For simplicity, refer to the baseline model as "V1" and the quantized model as "V2". Let the dataset contain \(N\) questions. For two model versions:

\(M_1\) = number of questions V1 got right
\(M_2\) = number of questions V2 got right (without loss of generality, you can think of this as lower than \(M_1\))
\(N\) = total number of questions

Observe that:

The minimum number of questions the two models can disagree on is \(|M_1 - M_2|\) (the questions V2 gets wrong are a strict subset of the ones V1 gets right; V2 gets no new questions right).
The maximum number of questions the two models can disagree on is \(\min((N - M_1) + (N - M_2), M_1 + M_2)\)

Diving into the \(\text{Maximum_Disagreement}\) formula:

The \((N - M_1) + (N - M_2)\) branch of the formula applies when the models get completely different questions wrong, as is the \((N = 100)\), \((M_1 = 85)\), \((M_2 = 84)\) scenario above, where it is clear that the models' maximum disagreement is \((100 - 85) + (100 - 84) = 15 + 16 = 31\).
The intuition for the “minimum with \(M_1 + M_2\)" in the maximum disagreement formula is: if V1 and V2 together get less than 100% of the questions right, it is possible for the two versions to get completely disjoint sets of questions correct, in which case they will agree only on the \(N - M_1 + M_2\) questions they both got wrong; saying the same thing differently, they will disagree on the \(M_1 + M_2\) questions they indivdually got right. The diagram below makes this clearer with some concrete numbers:

The Weidman Swap Score (WSS)

This pair of insights naturally leads to a normalized score that quantifies model disagreement within these two bounds: the Weidman Swap Score. This is a value from 0 to 1 that indicates where the actual disagreement falls relative to the minimum and maximum possible disagreement. Higher is worse (less stable):

\[ WSS = \frac{\text{Actual_Disagreement} - \text{Minimum_Disagreement}}{\text{Maximum_Disagreement} - \text{Minimum_Disagreement}} \]

Let's walk through a couple of simple numeric examples:

Example 1

Returning to the 85% and 84% example above, suppose that upon inspection, we find:

V2 got 3 questions wrong that V1 got right.
V2 also got one additional question wrong.
However, V1 got 3 questions wrong that V2 got right.

This implies the two versions disagree on a total of 7 questions. Using our variables for 85% and 84%:

The minimum disagreement is \(|84 - 85| = 1\).
The maximum disagreement is \(\min((100 - 85) + (100 - 84), 84 + 85) = 31\).

The Weidman Swap Score (WSS) is:

\[ WSS = \frac{7 - 1}{31 - 1} = \frac{6}{30} = 0.2 \]

Example 2

Let's look at a lower accuracy scenario. V1 gets 36 out of 100 right, whereas V2 gets only 33 right. Suppose V2 correctly answers 18 of the 36 questions that V1 got right, but misses the other 18. Additionally, V2 gets 15 questions right that V1 missed. This implies that the two versions disagree on 33 questions out of 100.

For the values of 36 and 33:

The minimum disagreement is \(3\).
The maximum disagreement is \(\min((100 - 33) + (100 - 36), 33 + 36) = 69\).

The Weidman Swap Score is:

\[ WSS = \frac{33 - 3}{69 - 3} = \frac{30}{66} = 0.455 \]

WSS' Relevance

WSS is most relevant when stability of a new model version relative to a baseline model matters. In the majority of cases when comparing two models or systems, this is not of primary, and possibly not even of secondary, importance: if a new architecture results in a model having 80% accuracy, over the prior version's 75% accuracy, you may not care (WSS could still be informative-a high WSS might indicate the newer model has mastered different aspects of whatever subject the benchmark was measuring than the older model-but it is less likely to "de-risk" the new model in the same way). By contrast, in comparisons where stability does matter, I expect WSS to be an important sanity check: if accuracy drops from 85% to 84%, while there may be nothing to worry about, if WSS is also close to 1, that could indicate that the quantization or distillation has introduced enough instability that the engineer may choose to experiment with other methods before shipping, even if the 1% accuracy drop is considered small on its own.

Commentary and Conclusion

For a given accuracy decrease of a model relative to a baseline-due to quantization or distillation, for example-a lower WSS is preferable, suggesting that users of the model will experience more stability and need to do less work to modify any workflows that depend on the model behaving as they expect. As people use LLMs for more and more aspects of their personal and professional lives, and as more and more agentic and even non-agentic software systems which have LLMs as an important component, I expect stability between model versions to become increasingly important¹ (though I expect absolute performance to remain paramount); thus, researchers and companies that employ them should get in the habit of using this score to check model stability and/or developing their own methods for checking this. We may even see scenarios where organizations decide users would prefer a model with a slightly larger drop in accuracy with a much lower WSS; I leave those judgment calls to the researchers, product managers, engineers, and executives who have to make them.

In the short-to-medium term, I look forward to seeing what reporting of WSS uncovers about how stable or unstable existing quantization and distillation methods are, and I hope it enables the industry to make more informed tradeoffs between absolute performance and consistency between model versions going forward.

Appendix: Code

Below is Python code that takes in two ordered arrays which are either boolean or 1s and 0s, indicating whether each of two model versions got each question in a dataset correct, and returns the relevant metrics including the Weidman Swap Score:

def swap_metrics(correct_v1, correct_v2):
    assert len(correct_v1) == len(correct_v2)
    N = len(correct_v1)

    M1 = sum(correct_v1)
    M2 = sum(correct_v2)

    swap_outs = sum(c1 and (not c2) for c1, c2 in zip(correct_v1, correct_v2))
    swap_ins = sum((not c1) and c2 for c1, c2 in zip(correct_v1, correct_v2))
    dis = swap_outs + swap_ins

    min_dis = abs(M2 - M1)
    max_dis = min(M1 + M2, 2 * N - M1 - M2)  # 2N - M1 - M2 = (N - M1) + (N - M2)

    # Handle the (rare) degenerate case where min_dis == max_dis.
    if max_dis == min_dis:
        wss = 0.0
    else:
        wss = (dis - min_dis) / (max_dis - min_dis)

    return {
        "N": N,
        "M1": M1,
        "M2": M2,
        "swap_outs": swap_outs,
        "swap_ins": swap_ins,
        "disagreement": dis,
        "min_dis": min_dis,
        "max_dis": max_dis,
        "wss": wss,
    }

Footnotes

¹ This conclusion is influenced by my time at SentiLink, where we built fraud models used by banks and lenders. We cared a lot about stability from model version to model version because our customers did. Stability is very important in regulated and/or "infrastructural" industries, of which financial services is both; you wouldn't want to wake up one day and see that your FICO score had changed by a lot because FICO moved to a slightly-more-accurate model that nevertheless resulted in a lot of swaps. I came up with the idea for this specific score after leaving SentiLink and never saw this specific metric computed while I was there (while we cared about model stability, we did not quantize or distill models as is commonly done in more "pure AI" settings). ↩