Multiple Imputation (MI) is a method for dealing with missing data in a statistical analysis. The general idea of MI is to simulate values for missing data points using the data we have on hand, generating multiple new sets of complete data. We then run our proposed analysis on all the complete data sets and combine the results to obtain overall estimates. The end product is an analysis with proper standard errors and unbiased estimates. For a gentle introduction to MI, see the StatLab article, Getting Started with Multiple Imputation in R. In this article, we demonstrate how to implement MI for a basic longitudinal data analysis using the R package mice. We’ll do this by first simulating a complete longitudinal data set and performing the correct analysis to obtain unbiased estimates. We’ll then generate missing data using a multivariate amputation procedure, re-run the analysis on the available data, and investigate how our estimates have changed for the worse. Finally we’ll use MI to impute the missing data, perform the planned analysis, and verify that we’re obtaining unbiased results.
Before we get started, let’s load the packages we’ll use in this tutorial. Throughout the article we’ll mention which functions we use from which packages. If you don’t have a package, you can install it using the function install.packages(). For example, to install the lme4 package, run install.packages("lme4").
library(lme4)
library(emmeans)
library(tidyr)
library(dplyr)
library(ggplot2)
library(mice)
library(naniar)
library(miceadds)
library(broom.mixed)
Simulate complete data
To begin, let’s generate some data. Below we generate data for 200 imaginary subjects measured at three time points. We use the expand.grid() function to quickly create three time points for each subject. Because subjects are measured three times each, we’ll have a total of 600 observations. We use the head() function to display the first six rows.
n <- 200
id <- 1:n
time <- 1:3
d <- expand.grid(time = time, id = id)
head(d)
time id
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
Now let’s add some covariates. First we’ll add what are called level 2 variables. These are subject-level variables that stay fixed over time. We create two variables called “x” and “age”. We might think of these as baseline measurements in our imaginary study. Again we use head() to view the first six records. Notice x and age are constant over time for subjects one and two. Of course a person’s age does technically change over time, but for this simple demonstration we’ll assume the measurements are made relatively close in time and that age does not meaningfully change.
set.seed(8989) # for reproducibility
x <- rnorm(n, mean = 5, sd = 1)
age <- round(rnorm(n = n, mean = 40, sd = 10))
d2 <- data.frame(id, x, age)
d <- merge(d, d2, by = "id")
rm(d2)
head(d)
id time x age
1 1 1 4.647489 49
2 1 2 4.647489 49
3 1 3 4.647489 49
4 2 1 5.231728 45
5 2 2 5.231728 45
6 2 3 5.231728 45
Next we generate a level 1 covariate. These are values that change over time. In this case we generate a simple binary indicator called “z”. We might think of this as some condition that each subject is observed having at each time point. We see subject one has the condition at time points one and two, while subject two has the condition at time points two and three.
d$z <- sample(x = 0:1, nrow(d), replace = TRUE)
head(d)
id time x age z
1 1 1 4.647489 49 1
2 1 2 4.647489 49 1
3 1 3 4.647489 49 0
4 2 1 5.231728 45 0
5 2 2 5.231728 45 1
6 2 3 5.231728 45 1
Finally we simulate a response variable, which we’ll call “y”. Since this is longitudinal data, we will have two sources of variation: (1) between subjects, and (2) within subjects. In other words, we’ll have subject-specific noise and observation-specific noise. We can think of noise as unexplained variability that our model won’t be able to account for. Below we sample 200 values from a normal distribution with mean 0 and standard deviation 1.2 to simulate subject-specific noise and save as “u”. Then we sample 600 values from a normal distribution with mean 0 and standard deviation 0.5 to simulate observation-specific noise and save as “e”. After that we generate y as a function of z, time, age and x. The parameters for the predictors are 1.5, 0.25, 0.5 and 0.1, respectively. (These were arbitrarily chosen. Feel free to choose your own parameter values.) Notice we incorporate the subject-specific noise into the intercept using the syntax (1.2 + u[d$id]). This means when we generate y, each subject gets a different intercept according to their specific random variation. Since the id variable is numbered 1 - 200, with each id number repeated three times per person, we can use R’s indexing brackets to extract the subject-specific noise from the u vector using the subject id as the index number.
u <- rnorm(n, sd = 1.2) # between subject variation
e <- rnorm(nrow(d), sd = 0.5) # within subject variation
d$y <- (1.2 + u[d$id]) + 1.5*d$z + 0.25*d$time + 0.5*d$age +
0.1*d$x + e
head(d)
id time x age z y
1 1 1 4.647489 49 1 28.00795
2 1 2 4.647489 49 1 27.91427
3 1 3 4.647489 49 0 26.82804
4 2 1 5.231728 45 0 23.04568
5 2 2 5.231728 45 1 25.59948
6 2 3 5.231728 45 1 25.60035
We now have a complete longitudinal data set where each subject is measured at three time points. The x and age variables are constant over time, and the z and y variables change over time.
In the next section we will randomly drop values to generate missing data and demonstrate multiple imputation. But first let’s analyze the complete data set using a mixed-effect model. In real life we have to decide what kind of model we want to fit. Do we include all or some of the variables? Do we need interactions? Do we need non-linear effects? Which variables need random effects? Since we generated the data we actually know the “correct” model. Let’s fit the same model we used to generate the data and see how close we come to recovering the true parameter values. To do this we’ll use the lmer() function from the lme4 package. Notice the model we fit, y ~ z + time + age + x + (1|id), is the same model we used above to generate y. The lmer() model syntax says “model y as a function of z, time, age and x, and let the intercept be conditional on subject.” This is precisely how we generated y. This is usually called a random-intercept model.
m <- lmer(y ~ z + time + age + x + (1|id), data = d)
summary(m, corr = F)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ z + time + age + x + (1 | id)
Data: d
REML criterion at convergence: 1504.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.35342 -0.53866 -0.00413 0.53512 2.68813
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 1.421 1.1922
Residual 0.273 0.5225
Number of obs: 600, groups: id, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.425258 0.547072 2.605
z 1.482173 0.051958 28.526
time 0.228381 0.026125 8.742
age 0.500884 0.008149 61.466
x 0.065706 0.085357 0.770
The estimated standard deviations of the subject and observation-specific noise distributions are listed under the “Std.Dev.” column in the Random effects section. Notice they’re quite close to the “true” values we used to simulate the data. Likewise, the estimated parameters for the predictors are listed under the “Estimate” column in the Fixed effects section. They, too, are quite close to the values used to simulate the data. This is expected since we fit the same model to the data that we used to generate the data.
If we assume the z variable was a predictor of interest, we might want to get the estimated outcome at time 3 for each level of z. We can get that using the emmeans() function from the emmeans package. We see the expected value of y at time 3 when z = 1 is about 24.1, which is about 1.5 higher than the expected value of y at time 3 when z = 0. (The expected difference is reported in the summary output above for z.)
emmeans(m, specs = ~z, at = list(z = 0:1, time = 3))
z emmean SE df lower.CL upper.CL
0 22.6 0.0945 269 22.5 22.8
1 24.1 0.0944 268 23.9 24.3
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Keep these results in your back pocket. We’ll refer to them throughout the article when we analyze this same data set with missing values and then again when we use multiple imputation to address the missingness.
Generate missing data
Now let’s take our complete data set and delete values. Obviously this is not something you would ever do on purpose with real data. We do it here to demonstrate how analyzing data with missing values can lead to biased (wrong) results.
Missing data is generally classified into one of three groups:
- Missing Completely at Random (MCAR). These are data that are randomly missing purely due to chance. For example, someone lost power while completing an online survey and didn’t get to finish because the Wi-Fi went down.
- Missing at Random (MAR). These are data that are missing where the probability of missing is conditional on data we observed. For example, older subjects may be less likely to complete a study with multiple follow-ups. Assuming we collect subjects’ age, we can imagine having some understanding for why certain data is missing.
- Missing Not at Random (MNAR). These are data that are missing where the probability of missing is conditional on data we did not observe. A classic example is subjects not answering survey questions about their income because it’s too low or too high. In this case the probability of the data being missing depends on data we did not collect.
It’s difficult to know what kind of missing data we have. The first type, MCAR, is often considered unrealistic (Van Buuren 2018). The second type, MAR, is what multiple imputation assumes. The third type, MNAR, is all but impossible to verify because missingness depends on data we didn’t collect. It’s also not possible to distinguish between MAR and MNAR from the data we have on hand (White et al. 2011). Therefore most researchers collect as much data as possible and assume their data is MAR. For a deeper but accessible dive into the concepts of missingness, we recommend section 1.2 in Van Buuren (2018).
We’re going to create data that is Missing at Random (MAR). That is, we’re going to generate missing data conditional on values in our data. The mice package provides a function specifically for this task called ampute(). The ampute() function allows us to relate the missingness of one variable to the missingness on another variable. This in turn allows us to study and evaluate MI techniques (Schouten et al. 2018).
Before we use ampute(), we reshape our data to wide format using the tidyr function pivot_wider(). The reason we do this is so we can ensure that level 2 data is completely deleted for a subject. For example, the variable x is constant over time for a subject. It wouldn’t make sense to just delete it at time = 2. It needs to either be present at all times, or completely missing over all times. Having our data in wide format ensures this happens. Notice after reshaping we have one row per subject. The level 1 variables z and y now have three columns each.
dw <- pivot_wider(d, names_from = time, values_from = c(z, y))
head(dw)
# A tibble: 6 × 9
id x age z_1 z_2 z_3 y_1 y_2 y_3
<int> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 4.65 49 1 1 0 28.0 27.9 26.8
2 2 5.23 45 0 1 1 23.0 25.6 25.6
3 3 4.21 19 0 1 0 10.5 12.3 10.5
4 4 5.12 42 1 0 0 26.9 25.2 25.1
5 5 4.58 47 1 1 0 27.7 26.0 26.5
6 6 5.13 34 1 0 0 21.8 20.0 19.4
Now we can run ampute() to generate missingness. Running set.seed(2525) allows you to replicate the result. The data frame with missing data is stored in the “amp” element of the dw_na object.
set.seed(2525)
dw_na <- ampute(dw)
head(dw_na$amp)
# A tibble: 6 × 9
id x age z_1 z_2 z_3 y_1 y_2 y_3
<int> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 NA 4.65 49 1 1 0 28.0 27.9 26.8
2 2 5.23 45 NA 1 1 23.0 25.6 25.6
3 3 4.21 19 0 1 0 10.5 12.3 10.5
4 4 5.12 42 1 0 0 26.9 25.2 25.1
5 5 4.58 47 1 1 NA 27.7 26.0 26.5
6 6 5.13 34 1 0 0 21.8 20.0 19.4
We’ve now got some missing data, but we have it in the “id” column where we wouldn’t expect missingness. It doesn’t make sense we would record all that data for subject 1, but not log their subject id. Fortunately the ampute() function provides a number of arguments to help us be more thoughtful in how we generate missing data.
The first argument we can use is the patterns argument. This allows us to specify which patterns of missing data we would like to generate. This requires a matrix. We can look at the “patterns” element from the ampute object from above to see the default missing data patterns. Each row represents a pattern. The value 0 in a row means the variable in that column should be missing. We see the default is to generate patterns where one and only one variable is missing. This means there are no instances of a subject missing more than one variable. This is probably unrealistic.
patterns <- dw_na$patterns
patterns
id x age z_1 z_2 z_3 y_1 y_2 y_3
1 0 1 1 1 1 1 1 1 1
2 1 0 1 1 1 1 1 1 1
3 1 1 0 1 1 1 1 1 1
4 1 1 1 0 1 1 1 1 1
5 1 1 1 1 0 1 1 1 1
6 1 1 1 1 1 0 1 1 1
7 1 1 1 1 1 1 0 1 1
8 1 1 1 1 1 1 1 0 1
9 1 1 1 1 1 1 1 1 0
Let’s define a new patterns matrix where we can specify our own patterns of missingness. First we decide to drop all rows in the default matrix except the second row. This is because we elected to keep a pattern where just x is missing.
patterns <- patterns[2,]
patterns
id x age z_1 z_2 z_3 y_1 y_2 y_3
2 1 0 1 1 1 1 1 1 1
Next we use rbind() to append five rows defining five different missing data patterns. The first new pattern has x and y_3 missing. The second pattern has y_2 and y_3 missing. And so on. We also purposely set id, age, z_1 and y_1 to never have missing data.
patterns <- rbind(patterns,
c(1,0,1,1,1,1,1,1,0),
c(1,1,1,1,1,1,1,0,0),
c(1,1,1,1,1,0,1,1,0),
c(1,1,1,1,0,0,1,1,1),
c(1,0,1,1,0,0,1,0,0))
patterns
id x age z_1 z_2 z_3 y_1 y_2 y_3
2 1 0 1 1 1 1 1 1 1
21 1 0 1 1 1 1 1 1 0
3 1 1 1 1 1 1 1 0 0
4 1 1 1 1 1 0 1 1 0
5 1 1 1 1 0 0 1 1 1
6 1 0 1 1 0 0 1 0 0
Now run ampute() again with our “patterns” matrix. This time there are no missing values for the id column and we see instances of more than one missing value for subjects.
set.seed(2525)
dw_na <- ampute(dw, patterns = patterns)
head(dw_na$amp)
# A tibble: 6 × 9
id x age z_1 z_2 z_3 y_1 y_2 y_3
<int> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 NA 49 1 1 0 28.0 27.9 26.8
2 2 5.23 45 0 1 1 23.0 25.6 25.6
3 3 4.21 19 0 1 0 10.5 12.3 10.5
4 4 5.12 42 1 0 0 26.9 25.2 25.1
5 5 4.58 47 1 1 NA 27.7 26.0 NA
6 6 5.13 34 1 0 0 21.8 20.0 19.4
Another argument we can use is the weights argument. This allows us to determine which variables determine missingness and to what degree. This also requires a matrix. Let’s look at the default weights matrix that was generated for our missing data patterns.
dw_na$weights
id x age z_1 z_2 z_3 y_1 y_2 y_3
1 1 0 1 1 1 1 1 1 1
2 1 0 1 1 1 1 1 1 0
3 1 1 1 1 1 1 1 0 0
4 1 1 1 1 1 0 1 1 0
5 1 1 1 1 0 0 1 1 1
6 1 0 1 1 0 0 1 0 0
Once again each row represents a missing data pattern. In the first row we see x = 0 and all other variables equal 1. This says all variables are given equal weight in determining the missingness of x. However, unlike the patterns matrix above, a 0 value can also mean that a variable is not used in determining whether a variable is missing. For example, in all patterns above subject id is influencing missingness. This doesn’t seem to make sense. Setting that variable to 0 in the weights matrix suppresses it from being used to generate missingness. It’s important to note that the weights matrix is not the same as the patterns matrix. The patterns matrix determines which variables are missing. The weights matrix determines which variables and to what degree they influence missingness.
Let’s modify the weights matrix to change how variables are influencing the missing data patterns. In row 1, we weight age as 4 times more influential as the other variables in determining the missingness for pattern 1. We also set id to 0 so it’s not used in generating missingness. In row 2, we weight the value of z_3 to be 4 times more influential than the other variables in determining the missingness of pattern 2. We also set id to 0 in this pattern and all other patterns. The weights we have chosen are arbitrary. Feel free to experiment with different weights. You can also set negative weights to decrease the influence of a variable. For more details on how weights are used to generate missingness, see Schouten et al. (2018).
weights <- matrix(data = c(0, 0, 4, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 4, 1, 1, 0,
0, 1, 1, 1, 3, 5, 1, 0, 0,
0, 1, 1, 1, 1, 0, 1, 5, 0,
0, 1, 6, 1, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 4, 0, 0),
nrow = 6, byrow = TRUE)
weights
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
[1,] 0 0 4 1 1 1 1 1 1
[2,] 0 0 1 1 1 4 1 1 0
[3,] 0 1 1 1 3 5 1 0 0
[4,] 0 1 1 1 1 0 1 5 0
[5,] 0 1 6 1 0 0 1 1 1
[6,] 0 0 1 1 0 0 4 0 0
Now run ampute() again with both “patterns” and “weights” matrices. This time, instead of looking at the first 6 rows, we look at the missing data patterns using the md.pattern() function from the mice package. (We set plot=FALSE to suppress a plot the function creates. Later in the article we’ll demonstrate and discuss the plot.) The bottom right corner value is the total number of missing data values (219). The numbers in the far left column are the number of rows in the data with that missing data pattern. For example, there are 20 instances of a subject missing only x. The sums at the bottom of the matrix show the number of missing values for each variable. We can see we have 65 instances of missing y at time 3.
set.seed(2525)
dw_na <- ampute(dw, patterns = patterns, weights = weights)
md.pattern(dw_na$amp, plot = FALSE)
id age z_1 y_1 z_2 y_2 z_3 x y_3
103 1 1 1 1 1 1 1 1 1 0
20 1 1 1 1 1 1 1 0 1 1
15 1 1 1 1 1 1 1 0 0 2
17 1 1 1 1 1 1 0 1 0 2
18 1 1 1 1 1 0 1 1 0 2
12 1 1 1 1 0 1 0 1 1 2
15 1 1 1 1 0 0 0 0 0 5
0 0 0 0 27 33 44 50 65 219
One thing to notice is that the frequency of the missing data patterns is roughly equivalent across the rows (see far left column). That is probably unrealistic. This leads us to the freq argument. This allows us to dictate the relative occurrence of the missing data patterns. Below we specify a vector of proportions that sums to 1. It says that of all cases with missing values, 10% should have patterns 1 - 3, 20% should have patterns 4 and 5, and 30% should have pattern 6. Notice we also add one more argument, mech. This allows us to set the missingness mechanism. The default is “MAR”, but can be set to “MCAR” and “MNAR”. 1 Now the distribution of missing data patterns is slightly more uneven and perhaps a bit more realistic.
set.seed(2525)
dw_na <- ampute(dw, patterns = patterns,
weights = weights,
freq = c(1,1,1,2,2,3)/10,
mech = "MAR")
md.pattern(dw_na$amp, plot = FALSE)
id age z_1 y_1 y_2 z_2 x z_3 y_3
104 1 1 1 1 1 1 1 1 1 0
10 1 1 1 1 1 1 1 0 0 2
15 1 1 1 1 1 1 0 1 1 1
11 1 1 1 1 1 1 0 1 0 2
22 1 1 1 1 1 0 1 0 1 2
15 1 1 1 1 0 1 1 1 0 2
23 1 1 1 1 0 0 0 0 0 5
0 0 0 0 38 45 49 55 59 246
Another argument available with ampute() is prop which allows you to specify the proportion of missingness. The default setting is 0.5, which we decided to leave as is. See the following vignette to learn more about the ampute() function and its other arguments: Generate missing values with ampute.
Now that we have generated data with missingness, we need to reshape it back to its original “long” format in order to analyze it. This time we use the tidyr function pivot_longer().
d2 <- pivot_longer(dw_na$amp,
cols = z_1:y_3,
names_to = c(".value", "time"),
names_pattern = "(.)_(.)",
names_transform = list(time = as.integer))
head(d2)
# A tibble: 6 × 6
id x age time z y
<int> <dbl> <dbl> <int> <int> <dbl>
1 1 NA 49 1 1 28.0
2 1 NA 49 2 1 27.9
3 1 NA 49 3 0 26.8
4 2 5.23 45 1 0 23.0
5 2 5.23 45 2 1 25.6
6 2 5.23 45 3 1 25.6
If we model this data set using lmer(), any rows with missing data are excluded from the analysis. Notice the number of observations drops from 600 to 369, and that the number of subjects drops from 200 to 151.
m2 <- lmer(y ~ z + time + age + x + (1|id), data = d2)
summary(m2, corr = F)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ z + time + age + x + (1 | id)
Data: d2
REML criterion at convergence: 994.8
Scaled residuals:
Min 1Q Median 3Q Max
-2.21309 -0.47236 0.02349 0.48162 2.35304
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 1.7112 1.3081
Residual 0.2741 0.5236
Number of obs: 369, groups: id, 151
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.09534 0.69874 2.999
z 1.45412 0.06930 20.982
time 0.17979 0.03558 5.053
age 0.49902 0.00989 50.457
x -0.03710 0.11582 -0.320
If we calculate the expected values of the outcome at time = 3 for each level of z we get lower estimates, about a full point lower. Compare these estimates to the ones made with a model fit to the complete data set:
# missing data
emmeans(m2, specs = ~z, at = list(z = 0:1, time = 3))
z emmean SE df lower.CL upper.CL
0 21.3 0.125 218 21.0 21.5
1 22.7 0.125 219 22.5 23.0
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
# complete data
emmeans(m, specs = ~z, at = list(z = 0:1, time = 3))
z emmean SE df lower.CL upper.CL
0 22.6 0.0945 269 22.5 22.8
1 24.1 0.0944 268 23.9 24.3
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Will this always happen when we generate missing data, or is this just due to chance? Let’s simulate 50 different data sets with missing data, fit a model to each data set, and look at the expected values of the outcome at time = 3. First we create a function to simulate a new set of outcomes. We keep the predictor variables fixed to mimic taking samples from a population with these characteristics.
gData <- function(){
# generate response
u <- rnorm(n, sd = 1.2)
e <- rnorm(nrow(d), sd = 0.5)
d$y <- (1.2 + u[d$id]) + 1.5*d$z + 0.25*d$time + 0.5*d$age +
0.1*d$x + e
d
}
Then we create a loop that
- generates a complete data set with new responses,
- uses
ampute()to generate missing data, - fits a model to the data with missingness,
- calculates the expected value of the outcome at time 3 for each level of z.
We run this loop 50 times, each time saving the result.
expy_na <- vector(mode = "list", length = 50)
for(i in 1:50){
# generate new outcomes
d2 <- gData()
# reshape for ampute() and create missingness
dw <- pivot_wider(d2, names_from = time, values_from = c(z, y))
dw_na <- ampute(dw, patterns = patterns,
weights = weights,
freq = c(1,1,1,2,2,3)/10,
mech = "MAR")
# reshape to long for modeling
d2 <- pivot_longer(dw_na$amp, cols = z_1:y_3,
names_to = c(".value", "time"),
names_pattern = "(.)_(.)",
names_transform = list(time = as.integer))
# fit model with missing data
m2 <- lmer(y ~ z + time + age + x + (1|id), data = d2)
# extract expected values for y at time = 3 for both levels of z
expy_na[[i]] <- summary(emmeans(m2, specs = ~z, at = list(time = 3)))$emmean
}
Once the loop finishes running, we collapse the expected values into a data frame with two columns: “level” and “expected_value”. To do this we use the base R do.call() function to apply rbind() to all elements in the expy_na object, which is a list. Then we use the dplyr function rename() to rename the columns and use the pivot_longer() function to reshape the data into long format.
expy_na <- do.call("rbind", expy_na) |>
as.data.frame() |>
rename(z0 = V1, z1 = V2) |>
pivot_longer(cols = everything(),
names_to = "level",
values_to = "expected_value")
head(expy_na)
# A tibble: 6 × 2
level expected_value
<chr> <dbl>
1 z0 21.6
2 z1 23.1
3 z0 21.6
4 z1 23.1
5 z0 21.4
6 z1 23.1
Now we plot the expected values along with the original expected values from our first model above that used complete data. The red dots represent the expected values calculated from the model fit to complete data. Notice the expected values from the models fit to incomplete data are consistently biased downward.
# expected values from model with complete data
em_out <- emmeans(m, specs = ~z, at = list(time = 3)) |>
as.data.frame()
expy_all <- data.frame(level = c("z0", "z1"),
expected_value = em_out$emmean)
ggplot(expy_na) +
aes(x = level, y = expected_value) +
geom_jitter(height = 0, width = 0.2) +
geom_point(data = expy_all, color = "red")

This is one of the key reasons we use multiple imputation: to avoid making biased estimates. Now let’s see how MI can help us avoid this problem.
Multiple Imputation for Longitudinal Data
Let’s work with the last d2 data frame created in the loop above.
head(d2)
# A tibble: 6 × 6
id x age time z y
<int> <dbl> <dbl> <int> <int> <dbl>
1 1 NA 49 1 1 26.3
2 1 NA 49 2 NA NA
3 1 NA 49 3 NA NA
4 2 5.23 45 1 0 25.1
5 2 5.23 45 2 1 27.9
6 2 5.23 45 3 1 28.2
Before we embark on multiple imputation it’s good practice to inspect and quantify the missingness in our data. We generated the missingness so we already know what patterns to expect, but let’s proceed as if we collected this data in the wild. The mice package provides the md.pattern() function which we used above with plot=FALSE. Let’s run it again but this time with plot=TRUE, which is the default. The red squares indicate missingness. Each row is a missing data pattern. The top row shows we have 357 observations with complete data. The next row shows we have 100 observations with x missing. The last row shows we have 54 observations with y, z, and x missing. The bottom row of numbers tell us how many values we have missing for each variable. For example, we have 107 instances of missing y. The bottom right corner value tells us we have 380 missing values altogether.
md.pattern(d2)

id age time y z x
357 1 1 1 1 1 1 0
100 1 1 1 1 1 0 1
36 1 1 1 1 0 1 1
24 1 1 1 0 1 1 1
11 1 1 1 0 1 0 2
18 1 1 1 0 0 1 2
54 1 1 1 0 0 0 3
0 0 0 107 108 165 380
The naniar package also provides functions for visualizing missing data. One such function is vis_miss(). This visualizes all 600 x 6 cells in the d2 data frame. The thin black rectangles indicate missingness.
vis_miss(d2)

The legend indicates only 10.6% of the data is missing. But be careful, this is percentage of data cells missing. From the previous plot created with the md.pattern() function, we saw we only had 357 rows of complete data. This implies that we have (600 - 357)/600 = 0.405, or about 40%, of rows with missing data. The 10.6% is calculated by dividing the total number of missing cells, 380, by the total number of cells, 600 x 6 = 3600. (380/3600 = 0.105).
Another function from the naniar package is the gg_miss_upset() function. This creates an UpSet plot. This basically visualizes the marginal counts displayed in the md.pattern() plot.
gg_miss_upset(d2)

In the bottom of the plot we see the missing data patterns. The bar plot above it visualizes the counts of those patterns. For example, the second vertical bar shows 54 observations with y, z, and x missing. The small bar plot on the left counts the number of cells with those variables missing. For example we see there are over 150 cells with x missing.
The main point to take away from these visualizations is that we have missing data on three variables and that we have six different patterns of missingness.
Now we move on to performing multiple imputation using the mice package. Recall that MI imputes data by conditioning on the data available in our data set. For example, to impute missing x values, we might try using all other available variables to create a model to predict the missing x values. To help determine which variables we should use to impute data we can use the mice function quickpred(). This function selects predictors using simple correlation statistics.
pm <- quickpred(d2)
pm
id x age time z y
id 0 0 0 0 0 0
x 0 0 1 0 0 1
age 0 0 0 0 0 0
time 0 0 0 0 0 0
z 0 0 1 1 0 1
y 1 0 1 1 1 0
The result is a matrix with one row for each variable. The columns are the variables available to use for imputing the row variable. Looking at the second row we see the quickpred() function suggests we use the age and y variables to impute x as indicated by the entries of 1. This is not surprising since we heavily weighted age and y in two of the patterns that generated missing x values in the previous section. A 0 entry means the mice package will not use the associated column variable to impute the row variable. The id, age and time rows have all zeroes since those variables have no missing data. The diagonal should contain all zeroes since a variable will never be used to impute itself.
Since we have longitudinal data, we need to indicate the id variable is a grouping variable. We do that with an entry of -2. This code is specific to the mice package.
pm[c("x", "z", "y"), "id"] <- -2
pm
id x age time z y
id 0 0 0 0 0 0
x -2 0 1 0 0 1
age 0 0 0 0 0 0
time 0 0 0 0 0 0
z -2 0 1 1 0 1
y -2 0 1 1 1 0
Another change we’ll make is to set the y entries to 3. This says we want to use the values of y as well as the group means of y (per subject id) to impute the row variables. This is another code specific to the mice package. For details on why we might want to add group means, see section 7.5.1 of Van Buuren (2018).
pm["x", "y"] <- 3
pm["z", "y"] <- 3
pm
id x age time z y
id 0 0 0 0 0 0
x -2 0 1 0 0 3
age 0 0 0 0 0 0
time 0 0 0 0 0 0
z -2 0 1 1 0 3
y -2 0 1 1 1 0
The next thing we need to do is specify the statistical method for each imputation model. The default method is “pmm”, which stands for predictive mean matching. The PMM method always finds values that have been actually observed in the data. For more on PMM, see section 3.4 of Van Burren (2018). We can see the default methods by calling the make.method() function on our data frame.
meth <- make.method(d2)
meth
id x age time z y
"" "pmm" "" "" "pmm" "pmm"
Since we have longitudinal data with two levels (between subjects and within subjects), we need to specify the two level version of predictive mean matching, “2l.pmm”. The miceadds package provides this method. Furthermore, since the x variable is a level 2 variable that stays fixed over time, we need to specify a “2lonly” method. The mice package provides a PMM method called “2lonly.pmm”. However we elected to use the “2lonly.norm” method, which imputes level 2 data using Bayesian regression modeling. (The reason we chose this method is because it improved the quality of imputations, which we inspect below using denistyplot().) The id, age, and time variables have no missing data so no methods need to be specified for them.
meth["x"] <- "2lonly.norm"
meth["z"] <- "2l.pmm"
meth["y"] <- "2l.pmm"
meth
id x age time z
"" "2lonly.norm" "" "" "2l.pmm"
y
"2l.pmm"
Now we’re ready to run the imputation. The m argument specifies the number of imputations we wish to run. Five imputations is the recommended minimum. White et al. (2011) suggest the number of imputations should be at least as large as the percentage of incomplete rows. In our case that would be at least 40. We elected to go with 10 to save computational time for the sake of this tutorial. The seed argument allows the imputations to be reproducible.
imp <- mice(data = d2,
m = 10,
predictorMatrix = pm,
method = meth,
print = FALSE,
seed = 5)
The mice imputation procedure utilizes a Markov Chain Monte Carlo algorithm that draws random samples from conditional densities. The details are described in section 4.5 of Van Buuren (2018). We can diagnose the quality of the sampling by checking trace plots, which we create by simply calling plot() on our imputation object. In general, we want the lines to overlap and not display any trends. These trace plots look good. There are 10 lines, one for each imputation. The x-axis shows we drew five samples for each imputation. The fifth iteration is the final sample the imputation procedure retains. The y-axis shows that the means and standard deviations of the imputations are convergent in the sense that “the variance between different sequences is no larger than the variance within each individual sequence.” (Van Buuren 2018, section 6.5.2)
plot(imp)

We can see the means that are being plotted at iteration 5 by calling colMeans() on the variable object within the imputation object. For example, here are the 10 means for the variable y being plotted at iteration 5.
colMeans(imp$imp$y)
1 2 3 4 5 6 7 8
24.66468 24.54145 24.47889 24.51582 24.48071 24.67126 24.55300 24.58549
9 10
24.61650 24.68741
It’s important to note that these are the means for just the imputed values. If we want to see the overall mean for y, we need to extract all ten imputed data sets. We can do that with the complete() function. Then we apply the mean() function to the y columns of the imputed data sets.
imp_all_data <- complete(imp, "all")
sapply(imp_all_data, function(x)mean(x$y))
1 2 3 4 5 6 7 8
23.03556 23.01359 23.00243 23.00902 23.00276 23.03674 23.01565 23.02144
9 10
23.02697 23.03962
Notice how much higher these means are than the mean computed on the data set with missingness.
mean(d2$y, na.rm = TRUE)
[1] 22.68198
And notice how much closer they are to the mean computed from the original complete data set.
mean(d$y)
[1] 23.16284
If the trace plots look problematic (no overlap of lines, slow convergence to a tight range of values), the problem may be with the predictor matrix. You can try dropping or adding certain variables to the various imputation models. Another option is to increase the number of iterations using the maxit argument. Iterations are different from imputations. Imputations refer to the number of complete data sets we generate. Iterations refer to the number of samples we ask the mice algorithm to draw for each imputation. You may need more than five iterations (the default) to reach convergence. For example, to increase iterations to 25, we would issue the following command:
# Not run
imp <- mice(data = d2,
m = 10, # 10 imputed data sets
maxit = 25, # 25 samples per imputation
predictorMatrix = pm,
method = meth,
print = FALSE,
seed = 5)
We should also check that we imputed plausible data. The densityplot() function generates smooth density histograms of our original data and the imputed data. The blue line is the original data and the red lines are the imputed data. These look pretty good. Our imputed data sets are in the range of the original data and exhibit similar enough distributions.
densityplot(imp)

Above we mentioned we chose the “21only.norm” method for imputing x. Notice what happens to the density plots for x if we use the “2lonly.pmm” method.
meth["x"] <- "2lonly.pmm"
imp2 <- mice(data = d2,
m = 10,
predictorMatrix = pm,
method = meth,
print = FALSE,
seed = 5)
densityplot(imp2)

The imputations for x are much more erratic. Hence the reason we decided to use the “2lonly.norm” method.
Another helpful function is striplot(). This shows one-dimensional scatter plots of an imputed variable versus its imputation number. Below we plot y using the syntax y ~ .imp. Imputation number 1 is actually the available data. Imputations 2 through 11 are the ten imputed data sets. We see that the imputed data (the red dots) are in the same range of the available data and seem like they could have been real values we might have observed.
stripplot(imp, y ~ .imp)

Now we’re ready to fit our model to all ten imputed data sets. The mice package provides a with() method that allows us to do this. The pool() function then combines the estimates from the ten analyses into a single set of estimates and standard errors. The broom.mixed package provides the necessary pool method to allow this to work with lme4 models.
fit.imp <- with(imp, lmer(y ~ z + time + age + x + (1|id)))
summary(pool(fit.imp))
term estimate std.error statistic df p.value
1 (Intercept) 1.2122384216 0.564606089 2.147051626 424.92686 3.235442e-02
2 z 1.4331074710 0.073869535 19.400521193 39.74088 7.125234e-22
3 time 0.2435943931 0.035920586 6.781470455 46.69113 1.808838e-08
4 age 0.5110267406 0.008437359 60.567141198 339.21859 5.830570e-184
5 x -0.0003924533 0.085629322 -0.004583165 361.77640 9.963457e-01
The pool() function implements Rubin’s rules for combining coefficients and standard errors, named for Donald Rubin who developed multiple imputation (White et al. 2011). Pooling the coefficients simply means taking the average of the ten coefficients. The fit.imp object contains the complete analyses of all ten imputed data sets.
length(fit.imp$analyses)
[1] 10
We can extract all ten coefficients by applying the fixef() function to all ten analyses. For example, let’s extract all ten coefficients for the variable z.
z_j <- sapply(fit.imp$analyses, function(x)fixef(x)["z"])
z_j
z z z z z z z z
1.513301 1.458340 1.438964 1.331407 1.403769 1.442742 1.428649 1.407148
z z
1.454071 1.452683
Now take the mean to get the coefficient reported above when we called pool().
z_hat <- mean(z_j)
z_hat
[1] 1.433107
Pooling the standard errors is a little more complicated. We need to take into account both the within-imputation variance and between-imputation variance. Let’s say our pooled coefficient is \(\hat{\theta}\). The pooled variance is calculated as
\[\text{var}(\hat{\theta}) = W + (1 + 1/m)B\]
where m is the number of imputations, \(W = (1/m)\sum_{j=1}^m\text{var}(\hat{\theta}_j)\) (i.e., the average of the variances) and \(B = (1/(m-1))\sum_{j=1}^m(\hat{\theta}_j - \hat{\theta})^2\) (i.e., the variance of the coefficients).
We can calculate these by hand as well. To calculate W, we extract the variance for Z from the variance-covariance matrix and take the mean. Notice the variances lie on the diagonal, hence we need to use the diag() function.
W <- sapply(fit.imp$analyses, function(x)diag(vcov(x))["z"]) |> mean()
To get B we simply calculate the variance of the ten z coefficients.
B <- var(z_j)
Then we use the formula above and take the square root to get the pooled standard error.
sqrt(W + (1 + 1/10)*B)
[1] 0.07386953
For a deeper dive into pooling for multiple imputation, we recommend section 2.3 of Van Buuren (2018).
Finally, we can use the emmeans package to estimate our outcome at time = 3 for both levels of z. Notice the estimates are very close to what we obtained when we analyzed the complete data set at the beginning of this article.
# imputed data
emmeans(fit.imp, specs = ~z, at = list(z = 0:1, time = 3))
z emmean SE df lower.CL upper.CL
0 22.5 0.101 151.7 22.3 22.7
1 24.0 0.109 83.8 23.8 24.2
Confidence level used: 0.95
# complete data
emmeans(m, specs = ~z, at = list(z = 0:1, time = 3))
z emmean SE df lower.CL upper.CL
0 22.6 0.0945 269 22.5 22.8
1 24.1 0.0944 268 23.9 24.3
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Hopefully this simple example gives you a foundation to learn more about multiple imputation for longitudinal and multilevel data. For a deeper dive, we recommend chapter 7 of the free online book, Flexible Imputation of Missing Data (Van Buuren 2018). Wijesuriya et al. also provide an open-access tutorial for R and Stata using both wide and long formats of data as well as different methods of imputation (2025).
R session details
The analysis was done using the R Statistical language (v4.5.2; R Core Team, 2025) on Windows 11 x64, using the packages lme4 (v1.1.37), Matrix (v1.7.4), broom.mixed (v0.2.9.6), emmeans (v2.0.0), miceadds (v3.18.36), naniar (v1.1.0), mice (v3.18.0), ggplot2 (v4.0.1), dplyr (v1.1.4) and tidyr (v1.3.1).
References
- Bates D, Maechler M, Bolker B, Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48. doi:10.18637/jss.v067.i01.
- R Core Team (2025). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
- Schouten RM, Lugtig P, & Vink G. (2018). Generating missing values for simulation purposes: a multivariate amputation procedure. Journal of Statistical Computation and Simulation, 88(15), 2909–2930. https://doi.org/10.1080/00949655.2018.1491577.
- Van Buuren S (2018). Flexible Imputation of Missing Data. CRC Press. https://stefvanbuuren.name/fimd/.
- Van Buuren S, Groothuis-Oudshoorn K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. DOI 10.18637/jss.v045.i03.
- Vink G. miceVignettes, https://www.gerkovink.com/miceVignettes/. Accessed 16 Nov. 2025.
- White IR, Royston P, and Wood AM (2011). “Multiple Imputation Using Chained Equations: Issues and Guidance for Practice.” Statistics in Medicine 30, no. 4, 377–99. https://doi.org/10.1002/sim.4067.
- Wijesuriya R, Moreno‐Betancur M, Carlin JB, White IR, Quartagno M, and Lee KJ. (2025). “Multiple Imputation for Longitudinal Data: A Tutorial.” Statistics in Medicine 44 (3–4): e10274. https://doi.org/10.1002/sim.10274.
Clay Ford
Statistical Research Consultant
University of Virginia Library
December 2, 2025
- Setting mech = “MCAR” uses a weights matrix with all zeroes. In other words, no other variables are used to determine missingness. Missingness is completely at random. Setting mech = “MNAR” allows you to set non-zero weights for variables that are to be deleted. This allows you to simulate missingness based on data that you did not collect.↩︎
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.