Second in my series of posts about random, mostly technical topics that strike my fancy.
Calibration metrics are crucial, but deceptively tricky. I’ve now encountered this problem in two different jobs, and been burnt by the naive solution below, so I’ve come to believe this question is extremely common. In 2022, some colleagues and I turned it into a paper (my coauthors deserve much more credit than I).
Calibration
Consider this common task: we have a binary outcome variable ($y \in \{0,1\}$) that we want to model. We use a logistic regression (or really any “classifier”) to predict it as $p = \hat{f}(x)$. Any model that generates a prediction of the probability that $y = 1$ will do.
An important desirable trait is that the model is calibrated: when we predict a probability $p$, the average $y$ value will actually be $p$. Mathematically, that $p = E[y | x]$.
Your team, like mine, will probably attempt to measure this in the following way.
First, let’s create a calibration plot:
- Any single $y_i$ will only be 0 or 1, so calibration requires averaging across many points.
- So we bin the $p_i$, into let’s say 10 bins, and calculate the average observed $\bar{y}_b$ in each bin.
- We plot $\bar{y}_b$ versus $\bar{p}_b$. (We need to average $p_i$ too since they will vary within a bin, but they’ll usually be close, obviously.)
View code
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(215)
def simulate_calibrated_data(n_obs: int):
p = np.random.rand(n_obs)
y = (np.random.rand(n_obs) < p).astype(int)
return pd.DataFrame.from_dict({"p":p, "y":y})
def calibration_bins(df):
bins = np.linspace(0, 1, 11)
df["bin"] = pd.cut(df.p, bins=bins)
return df.groupby("bin", observed=False).apply(
lambda df: pd.Series(
{
"p": df.p.mean(),
"y": df.y.mean(),
"cnt": df.y.shape[0]
}
),
).reset_index()
df0 = simulate_calibrated_data(1000)
df0_binned = calibration_bins(df0)
plt.scatter(df0_binned.p, df0_binned.y)
plt.plot((0,1), (0,1), c="grey")
plt.xlabel("p")
plt.ylabel("y")
plt.title("Binned Calibration (Calibrated Data)")
plt.show()
The calibration above looks basically good (the points are close to the $y=p$ line). There’s some noise; we are using finite data, after all.
To get a better view, we can zoom in by plotting the residual $\bar{y}_b – \bar{p}_b$ instead:
View code
plt.scatter(df0_binned.p, df0_binned.y - df0_binned.p)
plt.xlabel("p")
plt.ylabel("y - p")
plt.plot((0,0), c="grey")
plt.title("Binned Residuals (Calibrated Data)")
plt.show()
This plot is often good enough; it helps us understand the size of the error (really, the upper bound since this has noise) and where they might be. For this example, in our data the binned residuals go up to a size of 0.05. Is this big (and evidence of a miscalibrated p)? Is this small (and just due to sample noise)?
Once our team gets used to these plots, we’ll want to scale it. We might have hundreds of such models, and need a summary metric to tell us how bad each model’s line is.
The obvious choice, which once sucked up months of my life, is to take the absolute value of this residual within each bin, and average them over all $B$ bins. We called it “Empirical Calibration Error”. (You could also square residuals, or take the maximum. All of these approaches have the same issue.)
$$ ECE = \frac{1}{B}\sum_{b=1}^{B} |\bar{y}_b – \bar{p}_b | $$At first, this seems great. It’s a simple number that intuitively captures how far the residuals are from 0.
But once you look at enough models, you realize that the size of it is driven almost entirely by the amount of data you have. If you look at the plot above, you’ll notice that any sampling variance means residuals will vary more around zero, turning into positive absolute error.
Formally:
Suppose we have a perfectly calibrated model, and enough data in each bin that the residual $\bar{r}_b = \bar{y}_b – \bar{p}_b$ satisfies the Central Limit Theorem. Then $\bar{r}_b \sim N(0, \bar{p}_b (1 – \bar{p}_b) / N_b)$.*
(* This variance isn’t exactly correct if the bin is wide enough that $p_b$ is very different for different points in the bin, but is good enough for here.)
But the absolute value of $\bar{r}_b$ will now have a Half Normal distribution, which has non-zero mean $\sqrt{2\bar{p}_b(1 – \bar{p}_b) / \pi N_b}$. The sampling noise of the residual turns into upward bias of our metric. In our paper we called this the “Noise Floor”.
It gets worse. With only one bin, you could just look up the Half Normal distribution and use it. But our $ECE$ metric is not the $\bar{r}_b$ for a single bin, but the average of these $B$ Half-Normal absolute residuals across all bins. The distribution is pathological.
Kuiper metrics
If the intuitive ECE doesn’t work, what should we do instead? Let’s consider a slightly different approach. Looking at the binned residual plot above, there’s more information than treating each bin as independent. If we saw four bins in a row with positive residuals, that’s more compelling evidence of miscalibration than four bins with alternating positive and negative residuals of the same size. Miscalibration happens over a range, and consecutive bins with residuals of the same sign suggest a large range of miscalibration.
We can formalize this by taking the cumulative sum of the residuals. If there is range where the residuals are all positive, this sum will increase; if there’s a range where they are negative, it will decrease. In a well-calibrated model, the residuals will tend to cancel out, and the cumulative sum stay around zero.
Let’s generate some data from a miscalibrated model (in the real world we don’t know if the model is miscalibrated, but in simulations we can play God).
View code
def simulate_uncalibrated_data(n_obs: int):
p = np.random.rand(n_obs)
p_cal = 0.5 + (p - 0.5)**3 / (0.5**2)
y = (np.random.rand(n_obs) < p_cal).astype(int)
return pd.DataFrame.from_dict({"p":p, "y":y})
Notice that we have consecutive bins all with positive residuals, then all with negative residuals. Miscalibration at work.
View code
df1 = simulate_uncalibrated_data(1000)
df1_binned = calibration_bins(df1)
plt.scatter(df1_binned.p, df1_binned.y)
plt.plot((0,1), (0,1), c="grey")
plt.title("Binned Calibration Plot (Miscalibrated Data)")
plt.xlabel("p")
plt.ylabel("y")
plt.show()
View code
plt.scatter(df1_binned.p, df1_binned.y - df1_binned.p)
plt.hlines(y=0, xmin=0, xmax=1, colors="grey")
plt.title("Binned Residual Plot (Miscalibrated Data)")
plt.xlabel("p")
plt.ylabel("y - p")
plt.show()
And here is the cumulative sum of the residuals. It increases in the range where the residuals are all positive, then decreases where they are all negative.
View code
def kuiper_cumdiff(p, y):
order = np.argsort(p)
n = p.shape[0]
p = np.array(p)[order]
y = np.array(y)[order]
cum = np.cumsum(y - p) / n
return p, y, cum
def kuiper_range(p, y):
p, y, cum = kuiper_cumdiff(p, y)
return max(cum) - min(cum), p[np.argmax(cum)], p[np.argmin(cum)]
p, y, cum = kuiper_cumdiff(df1.p, df1.y)
print(f"{kuiper_range(df1.p, df1.y)=}")
plt.plot(p, cum)
plt.xlabel("p")
plt.ylabel("cumulative residual")
plt.title("Kuiper Cumulative Residual (Miscalibrated Data)")
plt.show()
kuiper_range(df1.p, df1.y)=(0.06795538765722418, 0.4827660574210261, 0.9459057791135457)
This plot is much less intuitive than our binned residuals, but provides a great metric of “goodness”. Locate the range between its highest value and lowest value. This range is maximally miscalibrated; the residuals in between are either disproportionately positive, driving the sum up, or disproportionately negative, driving it down. In our case, the maximum range is from $p=0.48$ to $p=0.95$, and the difference between its maximum and minimum is 0.068.
(Notice that since we’re dealing with cumulative sums, we don’t even need bins anymore! We do this point by point, sorting our entire dataset by $p$.)
For reference, here is that same cumulative residual plot for our calibrated data:
View code
p, y, cum = kuiper_cumdiff(df0.p, df0.y)
print(f"{kuiper_range(df0.p, df0.y)=}")
plt.plot(p, cum)
plt.xlabel("p")
plt.ylabel("cumulative resid")
plt.title("Kuiper Cumulative Residual (Calibrated Data)")
plt.show()
kuiper_range(df0.p, df0.y)=(0.012436758579207228, 0.2632094311061657, 0.635074514126753)
The maximum minus minimum is just 0.012.
You might think that the difference between these plots is obvious, but let’s develop a statistical test.
Here’s where we get lucky: it turns out that this difference between the cumsum’s maximum and minimum was deeply studied in the 1960s, for an entirely different purpose.
Mathematician Nicolaas Kuiper came across this exact problem when trying to test whether a data sample came from a given CDF. (Kolmogorov and Smirnoff developed a more famous, slightly different metric in the 1930s, too.) Kuiper did all of our work for us: he showed that under the null hypothesis that the residual is centered around zero, our cumulative sum can be analyzed using a “Brownian Bridge” distribution*. You don’t need to worry about exactly what that is, just be assured that we have formulas.
That’s enough to calculate a p-value. It turns out that this metric is highly powered–it’ll find miscalibration effectively. It avoids the ECE’s noise floor because it’s a cumulative sum of signed residuals, and they’ll tend to cancel out.
(* – To exactly get a Brownian Bridge, we need to also remove any linear trend in the residuals, forcing the overall sum to be zero. This is like saying we condition on the overall $E[y] = \bar{p}$, and we are only worried that regions of the curve may have errors.)
First, though, let’s do a placebo test.
We want to know if our sample data could have come from a calibrated model. One easy way is to simulate its range when the data is calibrated:
- Take the predicted probabilities as given.
- Generate new $y$ data from them that is calibrated.
- Calculate the Kuiper range.
- Repeat this 1,000 times, and see how our real number compares to these 1,000.
View code
def simulate_placebo(p):
y_sim = (np.random.rand(p.shape[0]) < p).astype(int)
return kuiper_cumdiff(p, y_sim)
Here are the range plots for four simulated placebo sets.
View code
for i in range(4):
p2, y2, c2 = simulate_placebo(df1.p)
print(f"{kuiper_range(p2, y2)[0]=}")
plt.plot(p2, c2)
plt.plot((0,0), color="grey")
plt.title(f"Kuiper Cumulative Residual (Calibrated Placebo Sample {i})")
plt.show()
kuiper_range(p2, y2)[0]=0.02326980681095479
kuiper_range(p2, y2)[0]=0.029304565644689338
kuiper_range(p2, y2)[0]=0.015385675397962914
kuiper_range(p2, y2)[0]=0.01443003709368676
None of ranges are close to our 0.068. Not looking good. Let’s do it 1,000 times.
View code
nsim = 1000
stats = []
for _ in range(nsim):
pi, yi, _ci = simulate_placebo(df1.p)
stats.append(kuiper_range(pi, yi)[0])
View code
plt.hist(stats)
print(f"{max(stats)=}")
plt.vlines(x=kuiper_range(df1.p, df1.y)[0], ymin=0, ymax=nsim/3, colors="red")
max(stats)=0.04858836706533195
Our observed value is way above the range of our placebo tests. We can confidently reject the null that the distribution is calibrated.
The mapie library
In the time since we wrote our paper, Cordier et al. have implemented a package mapie that does this for you! Here’s the Kuiper statistic (it doesn’t match our “Kuiper Range” because to get Kuiper’s proper statistic you need to de-trend, as mentioned above, and scale by $\frac{1}{N}\sqrt{\sum p_i (1 – p_i}$).
View code
from mapie.metrics.calibration import kuiper_statistic, kuiper_p_value
f"{kuiper_statistic(df1.y, df1.p)=}"
'kuiper_statistic(df1.y, df1.p)=5.283848188729132'
View code
f"{kuiper_p_value(df1.y, df1.p)=}"
'kuiper_p_value(df1.y, df1.p)=5.05992391319765e-07'
The mapie p-value agrees, we can reject the null that our distribution is calibrated.
Notice that, happily, it fails to reject the null for the calibrated data we generated:
View code
f"{kuiper_p_value(df0.y, df0.p)=}"
'kuiper_p_value(df0.y, df0.p)=0.954826452774466'