Getting Started with Bootstrap Model Validation

Let’s say we fit a logistic regression model for the purpose of predicting the probability of low infant birth weight, which is an infant weighing less than 2.5 kg. Below we fit such a model using the birthwt data set that comes with the MASS package in R. (This is an example model and not to be used as medical advice.)

We first subset the data to select four variables:

low: indicator of birth weight less than 2.5 kg (0 = no, 1 = yes)
ht: history of hypertension (0 = no, 1 = yes)
ptl: previous premature labor (0 = no, 1 = yes)
lwt: mother’s weight in pounds at last menstrual period

Then we model low as a simple additive function of the other three variables. The model says if we know the mother’s history of hypertension, whether or not she had previous premature labor, and her weight at last menstrual period, then we can use that information to estimate the probability of a low-birth-weight infant.


library(MASS)
data("birthwt")
d <- birthwt[,c("low", "ht", "ptl", "lwt")]
# convert ptl to indicator
d$ptl <- ifelse(d$ptl < 1, 0, 1)
m <- glm(low ~ ht + ptl + lwt, family = binomial, data = d)
summary(m)


  Call:
  glm(formula = low ~ ht + ptl + lwt, family = binomial, data = d)
  
  Deviance Residuals: 
      Min       1Q   Median       3Q      Max  
  -1.7420  -0.8018  -0.6895   0.9647   2.2460  
  
  Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
  (Intercept)  1.017367   0.853337   1.192  0.23317   
  ht           1.893971   0.721090   2.627  0.00863 **
  ptl          1.406770   0.428501   3.283  0.00103 **
  lwt         -0.017280   0.006787  -2.546  0.01090 * 
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 234.67  on 188  degrees of freedom
  Residual deviance: 210.12  on 185  degrees of freedom
  AIC: 218.12
  
  Number of Fisher Scoring iterations: 4

The model certainly seems like it’s doing something. The coefficients are “significant,” which means they’re reliably positive or negative. But is the model any good? Does it do a good job of estimating an appropriate probability for the event of interest?

One statistic to help us assess this is Somers’ D, which is a rank correlation between predicted probabilities and observed responses. It ranges between -1 and 1. When D = 1, the predictions are perfectly discriminating. When D = 0, the model is making random predictions. The basic idea is that a good model will return higher probabilities more often than not when the binary response is 1 versus 0.

We can calculate Somers’ D using the somers2() function in the Hmisc package. In the output, it is listed as “Dxy”. This function also lists another statistic, the c-index, which is also known as AUC or area-under-the-curve. (See this StatLab article to learn more about AUC.) A c-index value close to 1 is best. A value close to 0.5 indicates the model is guessing at random. Somers’ D and the c-index are related as follows:

\[D_{xy} = 2 \times (c - 0.5)\] To use the somers2() function, set the x argument to the predicted probabilities and the y argument to the observed responses.


library(Hmisc)
D_orig <- somers2(x = predict(m, type = "response"), y = d$low)
D_orig


            C         Dxy           n     Missing 
    0.7189048   0.4378096 189.0000000   0.0000000

The results are not bad. Somers’ D is well above 0 and the c-index is about 0.72. Our model may be good enough for clinical use.

But both indices are optimistic. We’re using the same data we used to fit the model to evaluate the model. How will our model work with new data not used in fitting the model? To answer this we use model validation.

A common approach to model validation is data-splitting. This is where we randomly split our data into training and test sets, build a model using the training data, and then see how the model performs on the test data. For our birthwt data set, this would mean holding out some portion of our data (say, 30%), building a model with the remaining data, and then evaluating how well the model works on the data held out. Hopefully the model would discriminate between low and normal birth weights with suitable probabilities.

The problem with the data-splitting approach is that it greatly reduces the sample size for model building. In designed experiments and field studies, data can be expensive and/or hard to obtain. We would prefer to not randomly “hold out” some portion of data when we run our analysis. Plus, any model validation statistic that we calculate is for the model built on a subset of the data, not all the data. In addition, a lucky split could produce an unusually good model. Cross-validation alleviates some of these problems, but again, it does not validate the full model built using all the data (Harrell, 2015, p. 112).

A more efficient approach is bootstrap model validation. The basic steps are as follows:

Resample the data with replacement.
Refit the same model to the resampled data.
Evaluate the performance of the refit model on the resampled data set (using a index such as Somers’ D).
Evaluate the performance of the refit model on the original data set.
Take the difference between steps 3 and 4.
Repeat steps 1--5 about 200 times.
Take the average of the 200 differences and subtract it from the original index estimate to get a bias-corrected estimate of how the final model will perform on future data.

Let’s demonstrate by iterating through steps 1--5 above one time. First we resample the number of rows with replacement and save the result to i. (We use set.seed(11) so the interested reader can follow along and reproduce these results.) Then we use i to subset the data when we refit the model. Next we calculate two versions of Somers’ D: one using the resampled data (which we call D_train) and another using the original data (which we call D_test). Notice we use subsetting brackets to extract just the “Dxy” value. Finally we take the difference between the two statistics and subtract it from the original estimate. Notice the “Corrected” Somers’ D is slightly lower.


set.seed(11)
i <- sample(nrow(d), size = nrow(d), replace = TRUE)
m2 <- glm(low ~ ht + ptl + lwt, family = binomial, data = d[i,])
D_train <- somers2(x = predict(m2, type = "response"), 
                   y = m2$y)["Dxy"]
D_test <- somers2(x = predict(m2, newdata = d, type = "response"), 
                  y = d$low)["Dxy"]
D_diff <- D_train - D_test
cbind("Original" = D_orig["Dxy"], 
      "Corrected" = D_orig["Dxy"] - D_diff)


       Original Corrected
  Dxy 0.4378096 0.4366723

Now let’s do this 200 times. We could create a loop to do this, but we’ll use the boot() function from boot package that comes with R. We first need to create a function that takes at least two arguments, one for the data frame and one for resampling the data frame. We name the function somersd and call the first argument data. This argument will take the data frame we use for resampling. We call the second argument index. This argument will take the resampled row numbers of the data frame that boot() will use to create the bootstrap sample. You can see the body of the function is basically our code from above, with d and i replaced with data and index. The function returns the difference between the two Somers’ D estimates. Recall the first estimate is for the performance of the bootstrapped model on the bootstrapped data, and the second estimate is for the performance of the bootstrapped model on the original data.


somersd <- function(data, index){
  m <- glm(low ~ ht + ptl + lwt, family = binomial, 
           data = data[index,])
  # train
  D_train <- somers2(predict(m, type = "response"),
                     m$y)["Dxy"]
  # test
  D_test <- somers2(predict(m, newdata = data, type = "response"), 
                     d$low)["Dxy"]
  D_train - D_test
}

With our function created, we’re ready to proceed with bootstrapping. We first load the boot package so we can access the boot() function. The first argument is the data frame we want to bootstrap. In this case, this is our original data set, d. The second argument is the statistic we want to calculate on each bootstrap sample. This is where our custom somersd() function comes in. The third argument, R, is how many bootstrap replicates we wish to perform. Finally, we save the result to an object called sd.out. We also use set.seed(222) so you can replicate the results.


library(boot)
set.seed(222)
sd.out <- boot(data = d, statistic = somersd, R = 200)

The bootstrap replicates are in a vector called t in the sd.out object. We can take the mean of that vector and subtract it from our original Somers’ D estimate to get a bias-corrected estimate. The estimate is a bit lower, reflecting the fact our model won’t perform as well on future data.


D_orig["Dxy"] - mean(sd.out$t)


        Dxy 
  0.4247608

The rms package provides the validate() function that automatically performs bootstrap model validation for Somers’ D and a number of other indices. The only catch is we need to use the modeling functions in the rms package in order to use validate(). In this case we need to use the lrm() function, which stands for “logistic regression modeling.” It uses the same formula notation as the glm() function but does not have a family argument. To use validate() with the model object, we need to also include the arguments x = TRUE, y = TRUE when fitting the model. This simply saves the original data with the model object. Below we tell validate() to run 200 bootstrap replicates for our fitted rms model.


library(rms)
m_rms <- lrm(low ~ ht + ptl + lwt, data = d, x = TRUE, y = TRUE)
set.seed(333)
v <- validate(m_rms, B = 200)
v


            index.orig training    test optimism index.corrected   n
  Dxy           0.4378   0.4496  0.4294   0.0201          0.4177 200
  R2            0.1713   0.1933  0.1557   0.0377          0.1336 200
  Intercept     0.0000   0.0000 -0.0554   0.0554         -0.0554 200
  Slope         1.0000   1.0000  0.9007   0.0993          0.9007 200
  Emax          0.0000   0.0000  0.0327   0.0327          0.0327 200
  D             0.1246   0.1434  0.1121   0.0313          0.0933 200
  U            -0.0106  -0.0106  0.0052  -0.0158          0.0052 200
  Q             0.1352   0.1540  0.1069   0.0471          0.0880 200
  B             0.1852   0.1800  0.1897  -0.0097          0.1950 200
  g             0.8943   0.9722  0.8331   0.1390          0.7553 200
  gp            0.1751   0.1807  0.1638   0.0168          0.1583 200

The first row shows Somers’ D as “Dxy”. The bootstrapped validated value is listed in the “index.corrected” column. Notice it’s similar to the value we calculated using boot(). Of course, it’s slightly different because bootstrapping is based on randomly selected data sets. The “index.orig” column shows our original estimate of Somers’ D, 0.4378. The “training” and “test” columns show the average of the two Somers’ D estimates made during bootstrapping. The “training” estimate is the average bootstrap model performance on the bootstrapped data. The “test” estimate is the average bootstrap model performance on the original unsampled data. The “optimism” column is the difference between training and test. The “index.corrected” value is calculated by subtracting “optimism” from “index.orig”.

The validate() output lists 10 more indices. See this extremely helpful blog post for a summary of what they mean and how to interpret them. The “R2” index, as you may have guessed, is a version of the traditional R-squared associated with linear models. It ranges from 0 to 1 and is interpreted as the proportion of variance explained. Clearly, values near 1 would be preferred. The “B” index is the Brier score. In its most common formulation, it ranges between 0 and 1, with a score near 0 being desirable. A score of 0.25 is associated with random guessing. See this StatLab article to learn more about Brier scores.

Based on these bootstrapped indices, we’re a little more modest about our model’s potential performance in a clinical setting. Whereas before we may have been excited about “significant” coefficients, we now see that our model’s predictive ability is decent but not great. Hopefully you now have a better understanding of how to use the bootstrap to validate a statistical model.

R session details

The analysis was done using the R Statistical language (v4.5.1; R Core Team, 2025) on Windows 11 x64, using the packages boot (v1.3.31), Hmisc (v5.2.3), rms (v8.0.0) and MASS (v7.3.65).

References

Canty, A., & Ripley, B. (2021). boot: Bootstrap R (S-Plus) functions (Version 1.3-28). https://CRAN.R-project.org/package=boot
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applications. Cambridge University Press.
Harrell, F. E., Jr. (2022). rms: Regression modeling strategies (Version 6.3-0). https://CRAN.R-project.org/package=rms
Hosmer, D. W., & Lemeshow, S. (1989). Applied logistic regression. Wiley.
Harrell, F. E., Jr. (2015). Regression modeling strategies (2nd ed.). Springer.
Harrell, F. E., Jr. (2022). Hmisc: Harrell miscellaneous (Version 4.7-1). https://CRAN.R-project.org/package=Hmisc
R Core Team. (2022). R: A language and environment for statistical computing (Version 4.2.1). R Foundation for Statistical Computing. https://www.R-project.org/
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer.