A Brief on Brier Scores

Not all predictions are created equal, even if, in categorical terms, the predictions suggest the same outcome: “X will (or won’t) happen.” Say that I estimate that there’s a 60% chance that 100 million COVID-19 vaccines will be administered in the US during the first 100 days of Biden’s presidency, but my friend estimates that there’s a 90% chance of that outcome. If it comes to pass—100 million arms get 100 million pokes (not each, thankfully)—do our respective predictions get equal credit for accuracy? We both thought that it was more likely than not that 100 million doses would be administered. But for future bets about biomedicine, my friend may well be a better source of predictions: In this hopeful hypothetical, there was a fairly thin gap between my friend’s near-certain prediction of 90% and the eventual outcome; the gap between my prediction and the outcome was far larger. Neither of us perfectly foresaw what would happen (no one said, “There’s a 100% chance of 100 million shots”); both of our predictions were mostly right and a bit wrong—but my friend’s prediction was a lot more mostly-right and a lot less a-bit-wrong than mine.

The accuracy of a set of probabilistic predictions (also called “forecasts”) can be formally measured with a Brier score. The math is fairly intuitive. First, consider a set of events with binary outcomes (“X will or will not happen”; “Y will or will not happen”). If an event comes to pass (“X did happen”), it’s assigned a value of 1. If an event doesn’t occur (“Y didn’t happen”), it’s assigned a value of 0. Given probabilistic predictions for those events (“.92 probability of X”; “.34 probability of Y”), the Brier score is just the mean of squared differences between those predictions and their corresponding event scores (1s and 0s). That is, it’s the mean squared error:

\[\text{Brier score} = \frac{1}{N} \sum_{t\,=\,1}^N (f_t - o_t)^2\]

  • \(N\) is the number of events (and, accordingly, predictions) under consideration
  • \(t\) indexes the events/predictions from 1 to \(N\) (the first event, the second event, etc.)
  • \(f_{t}\) is the forecast (a probability from 0 to 1) for the tth event
  • \(o_{t}\) is the outcome (0 or 1) of the tth event

Larger differences between probabilistic forecasts and event outcomes reflect more error in predictions, so a lower Brier score indicates greater accuracy. And because all squared errors will lie between 0 and 1 (the maximum squared error is \(1^2 = 1\)), Brier scores calculated with the above formula are bound between 0 and 1: A Brier score of 0 reflects perfect accuracy (i.e., there is no difference between event scores—0s and 1s—and someone’s probabilistic predictions for those events), and a Brier score of 1 reflects perfect inaccuracy (i.e., someone assigns probabilities of 0 to events that occur and probabilities of 1 to events that do not occur).

Consider this example: On Monday, a person makes 10 probabilistic predictions about whether 10 different stocks will finish the week trading higher than their value at the beginning of the week. Below, I simulate (a) 10 stocks, (b) a prediction for each one about whether it will finish the week trading higher than its Monday-morning value, and (c) 10 binary outcomes, 1s and 0s, reflecting whether each stock finished the week trading above (= 1) or not-above (i.e., equal to or lower than) its starting value (= 0).

ticker_maker <- function() {
  paste(sample(toupper(letters), sample(3:4, 1), replace = TRUE), collapse = '')
dat <- data.frame(stocks = replicate(10, ticker_maker()),
                  predictions = round(runif(n = 10, min = 0, max = 1), digits = 2),
                  finish_higher = sample(0:1, size = 10, replace = TRUE))

   stocks predictions finish_higher
1     DOL        0.28             0
2    KHCP        0.73             1
3    THDM        0.89             1
4     NHS        0.54             1
5    SWGL        0.83             0
6    NUZJ        0.60             0
7     BGB        0.54             0
8    ULAD        0.09             0
9     ESL        0.33             1
10   RODX        0.93             1

To calculate this speculator’s Brier score, we follow the formula above:

sum((dat$predictions - dat$finish_higher)^2) / nrow(dat)

[1] 0.21774

0.218. One could do worse, but the predictions were far from perfect. Remember: A Brier score of 0 means perfect accuracy, and a Brier score of 1 means perfect inaccuracy. To further help with the interpretation of scores, consider that a perpetual fence-sitter—someone who assigns a probability of 0.5 to every event—would wind up with a Brier score of 0.25.

fence_sitter <- data.frame(predictions = rep(0.5, 100),
                           outcomes = sample(0:1, 100, replace = TRUE))
print(head(fence_sitter, n = 4), row.names = FALSE)

 predictions outcomes
         0.5        0
         0.5        0
         0.5        1
         0.5        1

sum((fence_sitter$predictions - fence_sitter$outcomes)^2) / nrow(fence_sitter)

[1] 0.25

(For a fence-sitter, every squared error will be \(0.5^2 = .25\), and the average of a bunch of 0.25s is just 0.25.)

If we had two stock speculators instead of one in the example above, determining whose predictions were more accurate would be simple: Who has the lower Brier score at the end of the week? We can, though, add nuance to a comparison of Brier scores by calculating Brier skill scores, which reflect the relative performance of one Brier score (representing a forecaster/set of forecasts; call this \(BS_{f}\)) over another Brier score (the “reference” Brier score; call this \(BS_{ref}\)):

\[\text{Brier skill score} = 1 - \frac{BS_f}{BS_{ref}}\]

If \(BS_{f}\) is identical to \(BS_{ref}\), then \(\frac{BS_f}{BS_{ref}} = 1\), and the Brier skill score is therefore 0, reflecting no relative improvement of \(BS_{f}\) over \(BS_{ref}\).

Consider, however, what happens if \(BS_{f}\) is perfect (0) and \(BS_{ref}\) is imperfect (not 0). In this case, when \(BS_{f}\) has perfect performance relative to \(BS_{ref}\), the Brier skill score is \((1 - \frac{0}{\text{some value (0, 1]}}) = 1\).

One final case remains: When \(BS_{f}\) is worse than \(BS_{ref}\). Consider what happens when \(BS_{f}\) is some value, say 0.5, and \(BS_{ref}\) is better (< 0.5). In this case, as \(BS_{ref}\) creeps toward zero, the value of \(\frac{BS_f}{BS_{ref}}\) skyrockets (e.g., \(\frac{0.5}{0.49} = 1.02\); \(\frac{0.5}{0.01} = 50\)). That term is subtracted from one, and thus the Brier skill score is negative and approaches \(-\infty\) as \(BS_{f}\)’s performance declines relative to \(BS_{ref}\).

The Brier skill score, then, has a lower limit of \(-\infty\) and an upper limit of 1. Negative values reflect worse relative performance of \(BS_{f}\) to \(BS_{ref}\); 0 reflects identical performance, and 1 represents perfect relative performance.

Multi-category predictions

Many probabilistic predictions are about events with binary outcomes: “Will Washington, D.C., become a state in 2021?” “Will another Cleveland sports team win a national championship before 2030?” A forecaster needs to issue only one likelihood estimate for each question. But that’s hardly true in all cases: “Between the Pfizer, Moderna, Oxford-AstraZeneca, and Johnson & Johnson COVID-19 vaccines, which one will show the highest effectiveness in clinical trials?” A probabilistic prediction in this case needs four elements, one number for each vaccine. (And, fortunately, we now have data to answer that particular question.)

With a slight adjustment, the Brier score formula introduced above, which is suitable for predictions of events with binary outcomes, can handle predictions of events with multi-category outcomes. Instead of taking the squared error once for each event under consideration and averaging those, we first calculate the sum of squared errors for each event based on the predictions issued for all of that event’s possible outcomes, and we then take the average of those sums.

\[\text{Multi-category Brier score} = \frac{1}{N} \sum_{t\,=\,1}^N \sum_{c\,=1\,}^R (f_{tc} - o_{tc})^2\]

  • \(N\) is the number of events under consideration
  • \(t\) indexes the events from 1 to \(N\) (the first event, the second event, etc.)
  • \(R\) is the number of possible outcomes for each event (these are also called outcome “classes” or “categories”)
  • \(c\) indexes the possible outcomes for each event from 1 to \(R\)
  • \(f_{tc}\) is the forecast (a probability) for the cth possible outcome of the tth event
  • \(o_{tc}\) is the realized outcome (0 or 1) of the cth possible outcome of tth event

As Glenn Brier, the score’s originator, put it in his 1950 Monthly Weather Review article outlining the formula, “The \(R\) classes are chosen to be mutually exclusive and exhaustive….” This means that for a given event, the \(R\) outcomes considered in the calculation of the Brier score should cover all possible outcomes of the event. (And, accordingly, the probabilities issued for a given event’s \(R\) possible outcomes should sum to 1.) One and only one of the \(R\) events will be assigned an outcome value of 1; the others will receive outcome values of 0.

Say that someone is making probabilistic predictions about the outcomes of a series of war games. The war games are set up so that there are three possible resolutions for each one: military victory, military defeat, or peace. I simulate below (a) predictions for each possible outcome for each of the 10 war games, and (b) realized outcomes for all 10 games.

mil_victory_pred <- vector(length = 10)
mil_defeat_pred <- vector(length = 10)
peace_pred <- vector(length = 10)
mil_victory <- vector(length = 10)
mil_defeat <- vector(length = 10)
peace <- vector(length = 10)

for (i in 1:10) {
  mil_victory_pred[i] <- round(runif(1, min = 0, max = 1), digits = 2)
  mil_defeat_pred[i] <- round((1 - mil_victory_pred[i]) /
                                sample(seq(from = 1.5, to = 3, by = 1), size = 1),
                              digits = 2)
  peace_pred[i] <- (1 - mil_victory_pred[i]) - mil_defeat_pred[i]
  mil_victory[i] <- sample(c(0, 0, 1), size = 1)
  mil_defeat[i] <- ifelse(mil_victory[i] == 1, 0, round(runif(1, min = 0, max = 1)))
  peace[i] <- ifelse(mil_victory[i] == 0 & mil_defeat[i] == 0, 1, 0)

games <- data.frame(game = 1:10, mil_victory_pred, mil_defeat_pred, peace_pred, mil_victory, mil_defeat, peace)
print(games, row.names = FALSE)

 game mil_victory_pred mil_defeat_pred peace_pred mil_victory mil_defeat peace
    1             0.12            0.59       0.29           1          0     0
    2             0.04            0.38       0.58           1          0     0
    3             0.07            0.37       0.56           0          1     0
    4             0.18            0.55       0.27           1          0     0
    5             0.11            0.59       0.30           0          0     1
    6             0.12            0.59       0.29           0          1     0
    7             0.76            0.10       0.14           0          0     1
    8             0.59            0.27       0.14           0          0     1
    9             0.94            0.02       0.04           0          0     1
   10             0.01            0.40       0.59           0          0     1

For each game, calculate the sum of squared errors between the prediction (0 to 1) and realization (0 or 1) of each possible outcome (military victory, military defeat, or peace). For example, for the first game, the sum of squared errors is: \((0.12 - 1)^2 + (0.59 - 0)^2 + (0.29 - 0)^2 = 1.2066\). Complete this sum-of-squared-errors process for each game, and then take the average of those values. To automate this:

sum((games$mil_victory_pred - games$mil_victory)^2 + (games$mil_defeat_pred - games$mil_defeat)^2 +
      (games$peace_pred - games$peace)^2) / nrow(games)

[1] 1.01106

This example demonstrates an important aspect of the multi-category Brier score formula: Its possible values are bound between 0 and 2, not 0 and 1 like in the formula introduced first. (I’ll call that first formula the “binary-events-only Brier score.”) As before, lower scores reflect more accuracy, and higher scores reflect less. Consider which set of predictions would generate maximum error for a given war game using the multi-category Brier score formula: A predicted probability of 1 for an outcome that doesn’t occur (say, military victory), a predicted probability of 0 for the outcome that does occur (say, peace), and a predicted probability of 0 for any other outcome(s) (in this example, military defeat). The sum of squared errors for the single war game in question would be:

\[(1 - 0)^2_{military\,victory} + (0 - 0)^2_{military\,defeat} + (0 - 1)^2_{peace} = 2 \]

If this tragically poor pattern of prediction held for all the other war games (assigning a probability of 1 to an outcome that doesn’t occur and a probability of 0 to an outcome that does), then the average of the sums of squared errors—the multi-category Brier score—would be the average of a bunch of 2s, namely: 2.

This formulation of the Brier score—which is the one originally described by Brier in 1950—also works for evaluating predictions about binary events, not just multi-category events. But compared to the simpler formula discussed initially, it complicates the calculation unnecessarily when one is only dealing with predictions about binary events, and it changes the scoring scale from [0, 1] to [0, 2]. Consider this example:

A weather forecaster makes a prediction for two upcoming days about whether it will snow. The forecaster says there’s a 75% chance of snow on the first day and a 92% chance of snow on the second. Normally, we think about these predictions as being about a single outcome—snow, yes/no—on each day. We can also, though, think about these predictions as being about two possible outcomes on each day: snow and not-snow. Conceptualizing things in these terms, the forecaster’s prediction of a 75% chance of snow for the first day implies a 25% chance of not-snow, and their 92% chance prediction for the second day implies an 8% chance of not-snow.

Say that it snows on both days. To calculate the forecaster’s Brier score using the simpler, binary-events-only formula, we calculate the mean squared error based on each day’s prediction about the outcome snow:

\[\frac{(.75 - 1)^2_{day\,1\,snow} + (.92 - 1)^2_{day\,2\,snow}}{2} = 0.03445 \]

To calculate the forecaster’s Brier score using the multi-category formula, we calculate the average of the events’ sums of squared errors, which are based on predictions about the outcomes snow and not-snow for each day:

\[\frac{[(.75 - 1)^2_{day\, 1\,snow} + (.25 - 0)^2_{day\,1\,not\,snow}] + [(.92 - 1)^2_{day\, 2\,snow} + (.08 - 0)^2_{day\,2\,not\,snow}]}{2} = 0.0689 \]

The Brier score obtained using the multi-category formula (which uses implied “not-snow” probabilities to represent multiple possible outcome categories for a binary event) is twice the value obtained using the binary-events-only formula. This relationship will always hold if you calculate Brier scores for a set of predictions about binary events using both formulas:

In the binary-events-only formula, the error associated with a prediction about the binary event \(s\) is the square of the difference between the predicted probability of \(s\) and the outcome of \(s\): \((f_s - o_s)^2\)

In the multi-category formula, the error associated with a prediction about the binary event \(s\) is the square of the difference between the predicted probability of \(s\) and the outcome of \(s\) plus the square of the difference between the implied probability of \(not\,s\) and the outcome of \(not\,s\), which is 0 if \(o_s\) is 1 and 1 if \(o_s\) is 0: \((f_s - o_s)^2 + [(1 - f_s) - o_{not\,s}]^2\)

The two squared errors that get summed together when using the multi-category formula to evaluate predictions of binary events are equal to each other: For example, if \(f_s = .6\) and \(o_s = 1\), then \((1 - f_s) = .4\) and \(o_{not\,s} = 0\), and \((.6 - 1)^2 = (.4 - 0)^2\). This means that the prediction error calculated for a binary event using the multi-category Brier score formula (which is a sum of squared errors) is twice as large as the prediction error calculated for that same event using the binary-events-only formula (which is a lone squared error).1

If we’re calculating the Brier score for just one event, we stop there—and thus the multi-category Brier score is two times the binary-events-only Brier score. If there are multiple events under consideration, we sum all the errors for the events (and the errors are twice as large when using the multi-category formula) and then divide by the number of events, \(N\). \(\frac{2x}{N}\) is twice as large as \(\frac{x}{N}\). The Brier score calculated with the multi-category formula for a set of \(N\) binary events is therefore twice as large as the Brier score calculated with the binary-events-only formula. (The range of the multi-category Brier score formula, [0, 2], is twice that of the binary-events-only Brier score formula, [0, 1], so the scores reflect the same forecasting skill in both cases, just on different scales. For the sake of simplicity, if you’re scoring predictions about exclusively binary events, the simpler score introduced first in this article is likely preferable.)

The Brier score is not alone as a method for assessing the accuracy of probabilistic predictions.2 But it’s a quality method: It provides an intuitive, simple-to-calculate way of evaluating forecasts, and the relative performance of forecasters can be easily compared with the Brier skill score. It’s a useful tool for anyone looking to get a clear sense of when predictions are on the money and when they’ve gone awry.

Useful links and additional notes:

  • You can read Glenn W. Brier’s delightfully concise 1950 Monthly Weather Review article describing the score here.
  • There are a number of packages/functions available in R that calculate Brier scores. For example, DescTools::BrierScore() works for binary events:

DescTools::BrierScore(resp = dat$finish_higher, pred = dat$predictions)

[1] 0.21774

  • The function BrierScore() from the mclust package can handle events with multi-category outcomes:

# BrierScore() takes as arguments a matrix of predictions and a vector of realized outcomes
# (i.e., a vector reflecting which outcome from a set of R possibilities came to pass for each event)
outcomes <- c(1, 1, 2, 1, 3, 2, 3, 3, 3, 3)

# By default, BrierScore() halves Brier scores in order to get scores on a [0, 1] scale;
# multiply the output by two to get scores on a [0, 2] scale
mclust::BrierScore(as.matrix(games[, c('mil_victory_pred', 'mil_defeat_pred', 'peace_pred')]), outcomes)*2

[1] 1.01106

  • The functions above are just a sample of those on offer for calculating Brier scores in R. You might also consider functions like scoring::brierscore() or rms::val.prob(), the latter of which calculates an assortment of other metrics in addition to Brier scores.

Jacob Goldstein-Greenwood
StatLab Associate
University of Virginia Library
February 15, 2021

  1. An algebraic demonstration of this relationship is below. The left side of the equation reflects 2 times the squared error between a forecast (\(f\)) and an outcome (I arbitrarily use 1 as the outcome here, but an example using 0 as the outcome leads to the same conclusion). The right side of the equation reflects the sum of squared errors for a binary prediction that is treated as comprising predictions about two outcomes (“\(f\) probability of ‘X,’ and \((1 - f)\) probability of ‘not-X’”). \[2(f - 1)^2 = (f - 1)^2 + [(1 - f) - 0]^2\] \[2(f^2 - 2f + 1) = (f^2 - 2f + 1) + (1 - 2f + f^2)\] \[2(f^2 - 2f + 1) = 2f^2 - 4f + 2\] \[\frac{2(f^2 - 2f + 1)}{2} = \frac{2f^2 - 4f + 2}{2}\] \[f^2 - 2f + 1 = f^2 - 2f + 1\] ↩︎
  2. The logarithmic scoring rule, for example, assigns a forecaster a score by taking the negative logarithm of their probabilistic prediction for the realized outcome of an event: \(-ln(f_{realized\,outcome})\). Say that the possible outcomes of an event are GREEN, YELLOW, and RED, and a forecaster issues probabilities for those outcomes of .55, .20, and .25, respectively. If GREEN is the true outcome, then the score is \(-ln(.55) = 0.598\). The ideal score is achieved when \(f_{realized\,outcome} = 1\) and is therefore \(ln(1) = 0\). In this formula, an \(f_{realized\,outcome}\) value between 0 and 1 will yield a positive value; thus the best score (0) is the minimum score, and more-accurate predictions yield smaller and smaller scores (e.g., \(-ln(.1) = 2.303; -ln(.9) = 0.105; -ln(1) = 0\) ). In the case of predictions for binary events—say, “70% chance of snow”—the logarithmic score is \(-ln(.7) = 0.357\) if it snows and \(-ln(1 - .7) = 1.204\) if it doesn’t. If predictions are made for several events, then the logarithmic score for all events under consideration is the sum of the events’ individual logarithmic scores: \[\sum_{t\,=\,1}^N -ln(f_{t\,realized})\] Where \(N\) is the number of events under consideration, \(t\) indexes the events from 1 to \(N\), and \(f_{t\,realized}\) is the forecast issued for the realized outcome of the tth event. Note that the logarithmic scoring rule can break if a probability of 0 is assigned to a realized outcome, as the logarithm of 0 is undefined. The Brier score does not have this issue. ↩︎

For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.