Measuring (Mis)Calibration

Second in my series of posts about random, mostly technical topics that strike my fancy.

Calibration metrics are crucial, but deceptively tricky. I’ve now encountered this problem in two different jobs, and been burnt by the naive solution below, so I’ve come to believe this question is extremely common. In 2022, some colleagues and I turned it into a paper (my coauthors deserve much more credit than I).

Calibration

Consider this common task: we have a binary outcome variable ($y \in \{0,1\}$) that we want to model. We use a logistic regression (or really any “classifier”) to predict it as $p = \hat{f}(x)$. Any model that generates a prediction of the probability that $y = 1$ will do.

An important desirable trait is that the model is calibrated: when we predict a probability $p$, the average $y$ value will actually be $p$. Mathematically, that $p = E[y | x]$.

Your team, like mine, will probably attempt to measure this in the following way.

First, let’s create a calibration plot:

  • Any single $y_i$ will only be 0 or 1, so calibration requires averaging across many points.
  • So we bin the $p_i$, into let’s say 10 bins, and calculate the average observed $\bar{y}_b$ in each bin.
  • We plot $\bar{y}_b$ versus $\bar{p}_b$. (We need to average $p_i$ too since they will vary within a bin, but they’ll usually be close, obviously.)
View code
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

np.random.seed(215)

def simulate_calibrated_data(n_obs: int):
    p = np.random.rand(n_obs)
    y = (np.random.rand(n_obs) < p).astype(int)
    return pd.DataFrame.from_dict({"p":p, "y":y})

def calibration_bins(df):
    bins = np.linspace(0, 1, 11)
    df["bin"] = pd.cut(df.p, bins=bins)
    return df.groupby("bin", observed=False).apply(
        lambda df: pd.Series(
            {
                "p": df.p.mean(),
                 "y": df.y.mean(),
                 "cnt": df.y.shape[0]
            }
        ),
    ).reset_index()

df0 = simulate_calibrated_data(1000)
df0_binned = calibration_bins(df0)

plt.scatter(df0_binned.p, df0_binned.y)
plt.plot((0,1), (0,1), c="grey")
plt.xlabel("p")
plt.ylabel("y")
plt.title("Binned Calibration (Calibrated Data)")
plt.show()

The calibration above looks basically good (the points are close to the $y=p$ line). There’s some noise; we are using finite data, after all.

To get a better view, we can zoom in by plotting the residual $\bar{y}_b – \bar{p}_b$ instead:

View code
plt.scatter(df0_binned.p, df0_binned.y - df0_binned.p)
plt.xlabel("p")
plt.ylabel("y - p")
plt.plot((0,0), c="grey")
plt.title("Binned Residuals (Calibrated Data)")
plt.show()

This plot is often good enough; it helps us understand the size of the error (really, the upper bound since this has noise) and where they might be. For this example, in our data the binned residuals go up to a size of 0.05. Is this big (and evidence of a miscalibrated p)? Is this small (and just due to sample noise)?

Once our team gets used to these plots, we’ll want to scale it. We might have hundreds of such models, and need a summary metric to tell us how bad each model’s line is.

The obvious choice, which once sucked up months of my life, is to take the absolute value of this residual within each bin, and average them over all $B$ bins. We called it “Empirical Calibration Error”. (You could also square residuals, or take the maximum. All of these approaches have the same issue.)

$$ ECE = \frac{1}{B}\sum_{b=1}^{B} |\bar{y}_b – \bar{p}_b | $$

At first, this seems great. It’s a simple number that intuitively captures how far the residuals are from 0.

But once you look at enough models, you realize that the size of it is driven almost entirely by the amount of data you have. If you look at the plot above, you’ll notice that any sampling variance means residuals will vary more around zero, turning into positive absolute error.

Formally:

Suppose we have a perfectly calibrated model, and enough data in each bin that the residual $\bar{r}_b = \bar{y}_b – \bar{p}_b$ satisfies the Central Limit Theorem. Then $\bar{r}_b \sim N(0, \bar{p}_b (1 – \bar{p}_b) / N_b)$.*

(* This variance isn’t exactly correct if the bin is wide enough that $p_b$ is very different for different points in the bin, but is good enough for here.)

But the absolute value of $\bar{r}_b$ will now have a Half Normal distribution, which has non-zero mean $\sqrt{2\bar{p}_b(1 – \bar{p}_b) / \pi N_b}$. The sampling noise of the residual turns into upward bias of our metric. In our paper we called this the “Noise Floor”.

It gets worse. With only one bin, you could just look up the Half Normal distribution and use it. But our $ECE$ metric is not the $\bar{r}_b$ for a single bin, but the average of these $B$ Half-Normal absolute residuals across all bins. The distribution is pathological.

Kuiper metrics

If the intuitive ECE doesn’t work, what should we do instead? Let’s consider a slightly different approach. Looking at the binned residual plot above, there’s more information than treating each bin as independent. If we saw four bins in a row with positive residuals, that’s more compelling evidence of miscalibration than four bins with alternating positive and negative residuals of the same size. Miscalibration happens over a range, and consecutive bins with residuals of the same sign suggest a large range of miscalibration.

We can formalize this by taking the cumulative sum of the residuals. If there is range where the residuals are all positive, this sum will increase; if there’s a range where they are negative, it will decrease. In a well-calibrated model, the residuals will tend to cancel out, and the cumulative sum stay around zero.

Let’s generate some data from a miscalibrated model (in the real world we don’t know if the model is miscalibrated, but in simulations we can play God).

View code
def simulate_uncalibrated_data(n_obs: int):
    p = np.random.rand(n_obs)
    p_cal = 0.5 + (p - 0.5)**3 / (0.5**2)
    y = (np.random.rand(n_obs) < p_cal).astype(int)
    return pd.DataFrame.from_dict({"p":p, "y":y})

Notice that we have consecutive bins all with positive residuals, then all with negative residuals. Miscalibration at work.

View code
df1 = simulate_uncalibrated_data(1000)
df1_binned = calibration_bins(df1)

plt.scatter(df1_binned.p, df1_binned.y)
plt.plot((0,1), (0,1), c="grey")
plt.title("Binned Calibration Plot (Miscalibrated Data)")
plt.xlabel("p")
plt.ylabel("y")
plt.show()
View code
plt.scatter(df1_binned.p, df1_binned.y - df1_binned.p)
plt.hlines(y=0, xmin=0, xmax=1, colors="grey")
plt.title("Binned Residual Plot (Miscalibrated Data)")
plt.xlabel("p")
plt.ylabel("y - p")
plt.show()

And here is the cumulative sum of the residuals. It increases in the range where the residuals are all positive, then decreases where they are all negative.

View code
def kuiper_cumdiff(p, y):
    order = np.argsort(p)
    n = p.shape[0]
    p = np.array(p)[order]
    y = np.array(y)[order]
    cum = np.cumsum(y - p) / n
    return p, y, cum

def kuiper_range(p, y):
    p, y, cum = kuiper_cumdiff(p, y)
    return max(cum) - min(cum), p[np.argmax(cum)], p[np.argmin(cum)]

p, y, cum = kuiper_cumdiff(df1.p, df1.y)
print(f"{kuiper_range(df1.p, df1.y)=}")
plt.plot(p, cum)
plt.xlabel("p")
plt.ylabel("cumulative residual")
plt.title("Kuiper Cumulative Residual (Miscalibrated Data)")
plt.show()
kuiper_range(df1.p, df1.y)=(0.06795538765722418, 0.4827660574210261, 0.9459057791135457)

This plot is much less intuitive than our binned residuals, but provides a great metric of “goodness”. Locate the range between its highest value and lowest value. This range is maximally miscalibrated; the residuals in between are either disproportionately positive, driving the sum up, or disproportionately negative, driving it down. In our case, the maximum range is from $p=0.48$ to $p=0.95$, and the difference between its maximum and minimum is 0.068.

(Notice that since we’re dealing with cumulative sums, we don’t even need bins anymore! We do this point by point, sorting our entire dataset by $p$.)

For reference, here is that same cumulative residual plot for our calibrated data:

View code
p, y, cum = kuiper_cumdiff(df0.p, df0.y)
print(f"{kuiper_range(df0.p, df0.y)=}")
plt.plot(p, cum)
plt.xlabel("p")
plt.ylabel("cumulative resid")
plt.title("Kuiper Cumulative Residual (Calibrated Data)")
plt.show()
kuiper_range(df0.p, df0.y)=(0.012436758579207228, 0.2632094311061657, 0.635074514126753)

The maximum minus minimum is just 0.012.

You might think that the difference between these plots is obvious, but let’s develop a statistical test.

Here’s where we get lucky: it turns out that this difference between the cumsum’s maximum and minimum was deeply studied in the 1960s, for an entirely different purpose.

Mathematician Nicolaas Kuiper came across this exact problem when trying to test whether a data sample came from a given CDF. (Kolmogorov and Smirnoff developed a more famous, slightly different metric in the 1930s, too.) Kuiper did all of our work for us: he showed that under the null hypothesis that the residual is centered around zero, our cumulative sum can be analyzed using a “Brownian Bridge” distribution*. You don’t need to worry about exactly what that is, just be assured that we have formulas.

That’s enough to calculate a p-value. It turns out that this metric is highly powered–it’ll find miscalibration effectively. It avoids the ECE’s noise floor because it’s a cumulative sum of signed residuals, and they’ll tend to cancel out.

(* – To exactly get a Brownian Bridge, we need to also remove any linear trend in the residuals, forcing the overall sum to be zero. This is like saying we condition on the overall $E[y] = \bar{p}$, and we are only worried that regions of the curve may have errors.)

First, though, let’s do a placebo test.

We want to know if our sample data could have come from a calibrated model. One easy way is to simulate its range when the data is calibrated:

  • Take the predicted probabilities as given.
  • Generate new $y$ data from them that is calibrated.
  • Calculate the Kuiper range.
  • Repeat this 1,000 times, and see how our real number compares to these 1,000.
View code
def simulate_placebo(p):
    y_sim = (np.random.rand(p.shape[0]) < p).astype(int)
    return kuiper_cumdiff(p, y_sim)

Here are the range plots for four simulated placebo sets.

View code
for i in range(4):
    p2, y2, c2 = simulate_placebo(df1.p)
    print(f"{kuiper_range(p2, y2)[0]=}")
    plt.plot(p2, c2)
    plt.plot((0,0), color="grey")
    plt.title(f"Kuiper Cumulative Residual (Calibrated Placebo Sample {i})")
    plt.show()
kuiper_range(p2, y2)[0]=0.02326980681095479
kuiper_range(p2, y2)[0]=0.029304565644689338
kuiper_range(p2, y2)[0]=0.015385675397962914
kuiper_range(p2, y2)[0]=0.01443003709368676

None of ranges are close to our 0.068. Not looking good. Let’s do it 1,000 times.

View code
nsim = 1000
stats = []
for _ in range(nsim):
    pi, yi, _ci = simulate_placebo(df1.p)
    stats.append(kuiper_range(pi, yi)[0])
View code
plt.hist(stats)
print(f"{max(stats)=}")
plt.vlines(x=kuiper_range(df1.p, df1.y)[0], ymin=0, ymax=nsim/3, colors="red")
max(stats)=0.04858836706533195

Our observed value is way above the range of our placebo tests. We can confidently reject the null that the distribution is calibrated.

The mapie library

In the time since we wrote our paper, Cordier et al. have implemented a package mapie that does this for you! Here’s the Kuiper statistic (it doesn’t match our “Kuiper Range” because to get Kuiper’s proper statistic you need to de-trend, as mentioned above, and scale by $\frac{1}{N}\sqrt{\sum p_i (1 – p_i}$).

View code
from mapie.metrics.calibration import kuiper_statistic, kuiper_p_value
f"{kuiper_statistic(df1.y, df1.p)=}"
'kuiper_statistic(df1.y, df1.p)=5.283848188729132'
View code
f"{kuiper_p_value(df1.y, df1.p)=}"
'kuiper_p_value(df1.y, df1.p)=5.05992391319765e-07'

The mapie p-value agrees, we can reject the null that our distribution is calibrated.

Notice that, happily, it fails to reject the null for the calibrated data we generated:

View code
f"{kuiper_p_value(df0.y, df0.p)=}"
'kuiper_p_value(df0.y, df0.p)=0.954826452774466'

Gaussian Processes are just Multivariate Normals

… or this wikipedia section is all you need to know.

I’m trying something different today. There are a handful of topics I’ve wanted to practice writing about. And I do have a blog. So this begins a series of posts about non-political, probably-technical topics I’ve wanted to write up.

Gaussian Processes are a robust, flexible way to fit models to data. But it’s hard to figure out what they are! Tutorials almost all begin with a definition of a random process (a probability measure over functions…?), and you immediately get lost in the weeds.

It took me several years, and few great textbooks, to build intuition for what a Gaussian Process actually is, and I think there’s a much better way to introduce it–at the expense of some of the rigor.

Let’s discuss GPs how I imagine they were actually discovered.

Fitting curvy lines through points

Suppose we have some data, defined as pairs of $x$ and $y$. $x$ can be a vector. (So can $y$, but we’ll pretend it’s scalar in this post.) In the example below, the data is missing for $x=7$, and that’s the point we want to predict.

View code
import numpy as np
from matplotlib import pyplot as plt
from collections.abc import Callable

np.random.seed(215)
x = np.array([0, 1, 2, 3, 4, 5, 6, 8, 9])
y = np.sin(x)
plt.scatter(x, y)
plt.vlines(x=7, ymin=-1, ymax=1, colors="lightblue", linestyles="dashed")
plt.xlabel("X")
plt.ylabel("Y")
pass

Experienced statistical modelers will have a number of techniques for modeling this curve and predicting missing data. They include:

  • Regression using powers of $x$: $y_i = \alpha + \beta_1 x_i + \beta_2 x_i^2 + …$
  • Loess Regression
  • Splines

To invent Gaussian Processes, instead, let’s start from first principles.

  • We can view this entire set of nine datapoints as one nine-dimensional random vector.
  • The points that are closer to each other on the $x$-axis will have more similar $y$-values.

This leads to the key insight for a GP.

A simple GP

Let’s model the data as a giant multivariate normal random variable with a covariance matrix where things that are closer in $x$ have more positive covariance:

$$ \vec{y} \sim MVN(\mu=0, \Sigma(\vec{x})) $$

where I’ve added the vector notation to emphasize these are all the $y$s, instead of a single one. We’ll use a prior mean of 0. This doesn’t mean we think the data themselves are zero, but that before we ever saw the data we would expect them to be centered around zero. If you had a real prior about the value you can add it, but it rarely matters in practice.

We need a covariance matrix $\Sigma$. Let’s make one up. We’ll create a kernel function, which takes in two $x$ values and spits out a covariance. We’ll arbitrarily use

$$ \Sigma[x_i, x_j] = \exp\{-((x_i – x_j) /2)^2\} $$

This is a squared exponential kernel (it’s often called a Gaussian Kernel, but I’ll use Gauss’s name to refer to only one thing in this post). Notice that on the diagonal the value will be $\exp\{0\} = 1$, and the covariance between two points decays to zero as the distance gets larger. According to this kernel, the covariance halves every time the distance increases by 1.66. Is that right? Who knows!

With this setup, it becomes easy to predict $y$ at $x=7$:

  1. Add the correct row and column to the covariance matrix for $x = 7$.
  2. Use the conditional normal equations to find the distribution of $y$.

To use the equations, let’s concatenate the data pairs $x_d, y_d$ with the value we want to predict: $y_p$ at $x_p$:

\begin{align*} y &= \begin{pmatrix} y_d \\ y_p \end{pmatrix} \\ x &= \begin{pmatrix} x_d \\ x_p \end{pmatrix} \end{align*}

to give \begin{align*} y &\sim MVN(\mu, \Sigma) \\ \mu &= \begin{pmatrix} \mu_d \\ \mu_p \end{pmatrix} = \begin{pmatrix} \vec{0} \\ 0 \end{pmatrix} \\ \Sigma &= \begin{pmatrix} \Sigma_{dd} & \Sigma_{dp} \\ \Sigma_{pd} & \Sigma_{pp} \end{pmatrix} \\ \end{align*}

The conditional normal equations tell us that conditional on $y_d$, $y_p$ is normally distributed with

$$ y_p | y_d \sim N(\mu_{post}, \sigma_{post}) $$

where

\begin{align*} & \mu_{post} = \mu_p + \Sigma_{pd} \Sigma_{dd}^{-1} (y_d – \mu_d) \\ & \Sigma_{post} = \Sigma_{pp} – \Sigma_{pd} \Sigma_{dd}^{-1} \Sigma_{dp} \end{align*}

(Aside: I love these equations, and it’s worth spending ten minutes building intuition here. They just translate information from the observed dimensions to the desired dimension, where the scaling and amount of information translated depends on the covariance.)

And we’re ready! We just plug in the values from our example data:

View code
def kernel_square_exponential(x_0, x_1, bandwidth=2):
    return np.exp(-np.power((x_0[:, np.newaxis] - x_1[np.newaxis, :]) / bandwidth, 2))

def conditional_normal_mu_sigma(
    mu_d: np.ndarray,
    x_d: np.ndarray,
    y_d: np.ndarray,
    mu_p: np.ndarray,
    x_p: np.ndarray,
    kernel: Callable[[np.ndarray, np.ndarray], np.ndarray],
) -> tuple[np.ndarray, np.ndarray]:
    sigma_dd = kernel(x_d, x_d)
    sigma_pd = kernel(x_p, x_d)
    sigma_pp = kernel(x_p, x_p)
    sigma_dd_inv = np.linalg.inv(sigma_dd)
    
    mu_post = mu_p + sigma_pd @ sigma_dd_inv @ (y_d - mu_d)
    sigma_post = sigma_pp - sigma_pd @ sigma_dd_inv @ sigma_pd.T
    return mu_post, sigma_post

x_p = np.array([7])
mu_post, sigma_post = conditional_normal_mu_sigma(
    mu_d=np.zeros_like(x),
    x_d=x,
    y_d=y,
    mu_p=np.zeros_like(x_p),
    x_p=x_p,
    kernel=kernel_square_exponential,
)

print(f"{mu_post=}, {sigma_post=}")

plt.scatter(x, y)
plt.scatter(x_p, mu_post, c="r")
plt.vlines(
    x=7,
    # use a 95% posterior predictive interval
    ymin=mu_post - 1.96*np.sqrt(sigma_post),
    ymax=mu_post + 1.96*np.sqrt(sigma_post),
    colors="r",
)
plt.xlabel("X")
plt.ylabel("Y")
pass
mu_post=array([0.68350561]), sigma_post=array([[0.01330855]])

In fact, we could use this function to produce predictions at any $x$, giving us a full line of expected values.

View code
x_p = np.arange(-1,12,0.5)
x_p = x_p[~np.isin(x_p, x)]

mu_post, sigma_post = conditional_normal_mu_sigma(
    mu_d=np.zeros_like(x),
    x_d=x,
    y_d=y,
    x_p=x_p,
    mu_p=np.zeros_like(x_p),
    kernel=kernel_square_exponential,
)

plt.scatter(x, y)
plt.scatter(x_p, mu_post, c="r")
plt.vlines(
    x=x_p,
    ymin=mu_post - 1.96*np.sqrt(np.diag(sigma_post)),
    ymax=mu_post + 1.96*np.sqrt(np.diag(sigma_post)),
    colors="r",
)
plt.xlabel("X")
plt.ylabel("Y")
pass

Notice that, compellingly, the posterior uncertainty is much smaller for interpolated points within the range of our data than for extrapolated points outside of it. And as we get far away, the posterior distribution converges to our prior distribution, $N(0, 1)$.

That’s it! We’ve built a simple Gaussian Process. The key points are:

  • The data is modeled as a giant multivariate normal vector, together with the points to predict.
  • The covariance is parametrized based on some kernel function such that “closer” points in $x$ have more positive covariance.
  • The conditional normal equations do the rest.

Let’s discuss some finer points.

First, notice that we don’t have a parametrized line. Statisticians who are used to equations of the form $y= \alpha + \beta x$, or splines, or some other “parametrized” model where you have a functional form whose coefficients you are trying to estimate, will find this unsettling. Among other things, the model doesn’t inherently model increasing- or decreasingness. There’s no slope. Instead, if one $y$ is low and the next $y$ is high, the points in between are expected to be increasing just because of proximity. Predicting a new point requires using every observed datapoint every time, and handling the possibly-growing $\Sigma_{dd}$ and $\Sigma_{pd}$ matrices.

Second, relatedly, the only way $x$ matters is via the kernel function. This makes the method extend really nicely to multi-dimensional $x$, or even more complicated structures (maybe some elements in $x$ are categorical? and the kernel function is 1 if they match, 0 if they don’t?). As long as you can provide a kernel function that spits out a covariance between any two $\vec{x_i}, \vec{x_j}$, you’re in business.

Third, this method doesn’t work with repeated $x$ values. If you gave two observations at $x=5$, with different $y$, the math would break since our kernel function would say they are perfectly correlated. Implicitly, we’re assuming that $y$ is exactly observed at each $x$. But that’s easy to fix by allowing for noise in the observation of $y$, which just becomes an additional $\sigma_e$ added to the diagonal of $\Sigma_{dd}$.

Fourth, what’s up with this whole “Process” thing? That deserves its own section.

A “Process”?

What makes this a Process is that it’s defined over infinite $x$s. We only ever have to deal with a finite number of points–our data points plus the points we want to predict–and that’s what makes it tractable, but the methodology is capable of producing values at any $x$. Rigorously defining this random distribution of infinitely-dimensional vectors (or “lines”), requires defining a “Statistical Process”.

If you consider this methodology before you’ve seen any data, it defines a prior over types of lines, with the kernel function determining the wiggliness of the probable lines. The output of your GP will be a compromise of the data itself with the wiggliness the prior expects. Here are some plots of lines sampled from the data-less prior for different denominators (or “bandwidths”) in our kernel function:

View code
n_sim = 5

x_sim = np.arange(0, 10, 0.1)

for bandwidth in (0.5, 1, 2, 4):
    sigma_sim = kernel_square_exponential(x_sim, x_sim, bandwidth=bandwidth)
    y_sim = np.random.multivariate_normal(np.zeros_like(x_sim), sigma_sim, n_sim)

    plt.plot(x_sim, y_sim.T)
    plt.title(f"{bandwidth=}")
    plt.show()    

This inspires a totally different way to talk about the GP, which is the formal way they are usually taught. Suppose we write write our infinitely many pairs of $y$, $x$ as $y = f(x)$. The values of $x$ are fixed but $y$ is random, which means $f$ itself is random! This random function is our “Process”. $f$ is a magical random function that is defined purely by the property that it makes the above methodology correct: for any finite set of $x$, $f(x)$ will have a multivariate normal distribution $N(\mu, \Sigma(x))$. I want to emphasize that I’m certain this definition of a “Process” did not emerge from first principles, but instead was developed by people (Andrey Kolmogorov and Norbert Wiener?) who wanted to try the intuitive thing above, and reverse engineered the definition that would make it valid. (Which honestly is probably how all definitions are developed.)

You never have to actually care about $f$ itself, since we only ever deal with it on a finite set of observed and predicted points, but it makes this well-defined. (If you want $y$ to have noise, just say $y = f(x) + \epsilon$, which turns into our friend $\sigma_\epsilon$ on the diagonal of $\Sigma$.)

Extensions of our simple GP

The Gaussian Process above isn’t ready for the real world. There are a number of extensions that have been built to make this workable. Those include:

  1. Estimating a kernel function. Instead of choosing a fixed kernel function, we usually want to allow for a parametrized family of kernel functions (allowing for e.g. different distance decays), and letting the model learn the optimal value. In this world, it helps to consider a two-step kernel function: first calculate a distance $d(x_i, x_j)$, and then use a parametrized decay $k(d; \theta)$, where we learn the $\theta$.
  2. Reducing computational complexity. Our GP algorithm required using every data point to make a new prediction, which obviously grows unwieldy. Sparse GP algorithms keep track of a smaller number of “inducing points”, and make new predictions by interpolating between them (among other tricks).