Simulating Multilevel Data

The term “multilevel data” refers to data organized in a hierarchical structure, where units of analysis are grouped into clusters. For example, in a cross-sectional study, multilevel data could be made up of individual measurements of students from different schools, where students are nested within schools. In a longitudinal study, multilevel data could be made up of multiple time point measurements of individuals, where time points are nested within individuals. To be appropriately modeled, multilevel data should be analyzed using multilevel models, also known as hierarchical linear regression or mixed effects models.

Multilevel modeling extends the Ordinary Least Squares (OLS) model in order to accommodate this hierarchical structure in the data. An OLS model with one predictor can be represented as:

\(y = \beta_0 + \beta_1 x_1 + e\),

where the dependent variable, \(y\), is a linear combination of the intercept \(\beta_0\), an independent variable \(x_1\) multiplied by slope \(\beta_1\), plus some error \(e\). This model assumes that \(e\) follows an independently and identically distributed (i.i.d.) normal distribution with a mean of 0 and constant variance, denoted as \(e \sim N(0, \sigma^2)\).

In multilevel data, individual observations within clusters could exhibit stronger correlations with individuals from the same cluster compared to their correlation to individuals from other clusters. Due to this, using an OLS model and failing to properly account for the clustering variable could lead to a violation of the i.i.d. assumption of the error term (\(e\)). To address this, one approach is to simply add the clustering variable as a covariate in the OLS model (e.g., \(x_2\)). However, this approach does not properly account for the correlation of observations within clusters. Moreover, if there are any other variables that are collinear with the cluster variable, this method is particularly inappropriate. This could introduce multicollinearity into the model, leading to unstable coefficient estimates and therefore preventing a good interpretation of the true effect of this variable. Instead, constructing a multilevel model properly accounts for the correlation of observations within clusters, and enables the inclusion and investigation of covariates specifically related to the clustering variable as variables of interest.

Multilevel data is typically described in levels. Level 1 represents the lowest unit of analysis, nested within Level 2. For instance, in our cross-sectional example of students nested within schools, students would be Level 1 and the schools would be Level 2. If our data looked at individuals across time, Level 1 would be the measurements at each time point and Level 2 would be the individuals. If we repeatedly measured students within schools, we would have three level data where Level 1 would be the time points, Level 2 would be the students, and Level 3 would be the schools.

The most basic multilevel model contains no predictors and is often referred to as a variance components model. The main difference between expressing an OLS model and a multilevel model is that we will now have equations for each level.

Level 1: \(y_{ij} = \beta_{0j} + e_{ij}; e_{ij} \sim N(0,\sigma^2_{e0})\),

Level 2: \(\beta_{0j} = \gamma_{00} + u_{0j}; u_{0j} \sim N(0,\tau^2_{u0})\),

Composite: \(y_{ij} = \gamma_{00} + u_{0j} + e_{ij}\).

In this model, \(y_{ij}\) is the response variable \(y\) for the \(i\)th person in the \(j\)th cluster, \(\beta_{0j}\) is the intercept which is now determined by \(\gamma_{00}\) and \(u_{0j}\). \(\gamma_{00}\) is the grand mean of \(y\). We can think of this equation as being split into two portions, a fixed portion and a random portion.

Fixed effects will be constant across individuals and clusters, in other words the parameter is applied to the sample as a whole. Random means that we believe that these parameters/data came from random sampling of a larger population. In other words, the random effects are not constant and will vary by individual or cluster. Therefore, the fixed portion of this model is \(\gamma_{00}\) and the random portion is \(u_{0j}\) and \(e_{ij}\) since they are drawn from a population distribution. \(u_{0j}\) is the cluster dependent portion of the intercept and will have a different value for each cluster that gets added to the grand mean or overall intercept, \(\gamma_{00}\). \(u_{0j}\) is assumed to come from a normal distribution with a mean of 0 and an estimated variance, \(\tau^2_{u0}\). \(e_{ij}\) is the residual for the \(i\)th individual in the \(j\)th cluster and is normally distributed with a mean of 0 and an estimated variance, \(\sigma^2_{e0}\).

Simulating a Variance Components Model

First, let’s go through how to simulate data from a variance components model. All of the following code is written for R Statistical Software (v4.2.2; R Core Team 2022).

First we will define the number of clusters, C and the sample size per cluster, N:


C <- 20 # Number of clusters
N <- 50 # Sample size per cluster

Now we can set the values for our fixed variable, \(\gamma_{00}\), and save it to an object called g00:


# Fixed
g00 <- 2 # grand mean of outcome y

Finally we will define the random values, \(u_{0j}\) and \(e_{ij}\), and save them to objects called u0j and eij, respectively. It is important to note that both variables are assumed to be drawn from normal distributions. To accomplish this, we use the rnorm() function, making sure to use set.seed() each time for replicability. Let’s start by generating u0j, which represents the cluster-specific deviation of each intercept from the fixed intercept. This means we want to draw a single value for each cluster from a normal distribution with a mean of 0 and variance of \(\tau^2_{u0}\), which we’ll call tau2 and set to 5. The u0j object now contains one value for each cluster. We will repeat each of these values N times so that each individual in the cluster has the same value.


# Random
tau2 <- 5

set.seed(707)
u0j <- rnorm(n = C, mean = 0, sd = sqrt(tau2)) 
u0j <- rep(x = u0j, each = N)

Next we can define eij which is the individual (Level 1) specific deviation from the true value of \(y\). This means each person will have their own eij value which will be drawn from a normal distribution with a mean of 0 and variance \(\sigma^2_{e0}\) which we’ll call sigma2 and set to 3.


# Random
sigma2 <- 3 # variance of eij
set.seed(777) 
eij <- rnorm(n = N*C, mean = 0, sd = sqrt(sigma2))

And finally we can follow the equation \(y_{ij} = \gamma_{00} + u_{0j} + e_{ij}\) to calculate \(y_{ij}\). We will simply compute a linear combination of g00, u0j, and eij and save that into an object called yij.


yij <- g00 + u0j + eij

Let’s test how well this it worked by building a multilevel model. Before building the model, we’ll need to combine everything into a data frame. To create the variance components model, we’ll need yij and the cluster identifier in the data frame, but for a reasonableness check we’ll put everything in there. First we’ll create a unique identifier for each individual case, then an identifier for the cluster they’re in. Finally everything will be combined into a data frame. Since g00 is only one value and will be the same for each individual, we repeat it for all individuals.


# Merging into a dataframe
ID <- 1:(N*C) # Creating a unique id for each individual
Cluster <- rep(x = seq(1,C), each = N) # Creating a cluster variable
df <- data.frame(y = yij,
                 ID = ID,
                 Cluster = Cluster,
                 eij = eij,
                 u0j = u0j,
                 g00 = rep(x = g00, times = N*C))

head(df)


            y ID Cluster        eij       u0j g00
1 -0.01799895  1       1  0.8483346 -2.866334   2
2 -1.55662749  2       1 -0.6902939 -2.866334   2
3  0.01846090  3       1  0.8847945 -2.866334   2
4 -1.55709630  4       1 -0.6907627 -2.866334   2
5  1.97195395  5       1  2.8382875 -2.866334   2
6  0.20974452  6       1  1.0760781 -2.866334   2


tail(df)


            y   ID Cluster         eij      u0j g00
995  5.947669  995      20  1.56876405 2.378905   2
996  3.524053  996      20 -0.85485166 2.378905   2
997  3.389301  997      20 -0.98960329 2.378905   2
998  4.355839  998      20 -0.02306619 2.378905   2
999  4.185742  999      20 -0.19316325 2.378905   2
1000 3.980374 1000      20 -0.39853087 2.378905   2

We can see that we have unique individuals assigned to different clusters and we have the exact number of rows we anticipated (20 x 50 = 1,000). Within cluster, each individual has the same value for u0j, they have different values for y and eij, and across clusters each individual has the same value for g00. Now let’s build a variance components model. To do this, we will need to load the lme4 package (v1.1-32; Bates et al., 2015) to use the lmer() function which will create the model.


library(lme4)

The lmer() function has two main components: the formula and the data. The formula follows the same “outcome ~ predictors” format from the commonly used lm() function. The predictors section accommodates standard operators (e.g., +, *, :, etc.) along with an additional section to define the level(s) or cluster variable(s). For a variance components model, we will have no predictors and will specify the random intercept to be conditional on the cluster variable using (1 | Cluster). This notation can be read as follows: “the model intercept, 1, is conditional on the cluster.”


# Building a model
m0 <- lmer(y ~ 1 + (1 | Cluster), data = df)
summary(m0)


Linear mixed model fit by REML ['lmerMod']
Formula: y ~ 1 + (1 | Cluster)
   Data: df

REML criterion at convergence: 4011.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.7718 -0.6717  0.0068  0.6572  3.3023 

Random effects:
 Groups   Name        Variance Std.Dev.
 Cluster  (Intercept) 5.930    2.435   
 Residual             2.954    1.719   
Number of obs: 1000, groups:  Cluster, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)   2.3557     0.5472   4.305

The output of this model contains several sections. We want to see how closely the model output resembles our simulated conditions.

First we’ll look at the “Scaled residuals:” section to determine if our error term follows a normal distribution with a mean of 0. We see that the center of the distribution, indicated by the “Median”, is 0.0068. Additionally, the absolute values of the first and third quartiles are roughly equivalent, and the absolute values of the minimum and maximum are roughly equivalent. This suggests that the model error comes from a normal distribution (symmetric with a center of 0), however, in practice, follow-up diagnostics should be conducted.

Next we look at the “Random effects:” portion. We simulated \(\tau^2_{u0}\) equal to 5. Comparing this with the variance of the “Cluster (Intercept)”, which is 5.930, we find that the estimated variance is in a reasonable range compared to the true value of 5. Likewise, we simulated our data with \(\sigma^2_{e0}\) equal to 3, and the estimated variance of the “Residual” in the output is reported as 2.954, indicating a reasonably close match.

Finally, in the “Fixed effects:” section we’ll look for the grand mean of \(y\), denoted as \(\gamma_{00}\). We simulated this value equal to 2, which is very close to the recovered value of 2.3557.

Overall, the output of the model is very close to our simulated conditions.

Adding Predictors

To the variance components model, we can introduce both Level 1 and Level 2 predictors. A Level 1 predictor is any data specific to the individual units at Level 1. For instance, in a two-level cross-sectional model with students at Level 1 and schools at Level 2, a Level 1 predictor could be the GRE scores of each student. A Level 2 predictor is any data specific to the Level 2 units, such as average classroom size for each school. Introducing both a Level 1 and a Level 2 predictor into the model modifies the above equations to:

Level 1: \(y_{ij} = \beta_{0j} + \beta_{1} X_{1ij} + e_{ij}; e_{ij} \sim N(0,\sigma^2_{e0})\),

Level 2: \(\beta_{0j} = \gamma_{00} + \gamma_{01} Z_j + u_{0j}; u_{0j} \sim N(0,\tau^2_{u0})\),

Composite: \(y_{ij} = \gamma_{00} + \gamma_{01} Z_j + u_{0j} + \beta_{1} X_{1ij} + e_{ij}\).

The Level 1 equation includes the Level 1 predictor, \(X_{1ij}\) with its corresponding coefficient, \(\beta_1\). The Level 2 equation now incorporates the Level 2 predictor, \(Z_j\), and its coefficient, \(\gamma_{01}\). The intercept \(\beta_{0j}\) is determined by the combination of the fixed intercept parameters \(\gamma_{00}\) and \(\gamma_{01}\), as well as the random cluster-specific deviation \(u_{0j}\).

\(\gamma_{00}\) is now the expected value of \(y_{ij}\) when all other predictors are 0. \(\gamma_{01}\) is the mean difference between clusters who differ by 1 on \(Z_j\). \(u_{0j}\) is still assumed to follow a normal distribution with a mean of 0 and estimated variance \(\tau^2_{u0}\) and is assumed to be uncorrelated with \(X, Z,\) and \(e_{ij}\). \(u_{0j}\) can be interpreted as the Level 2 group variation around \(\gamma_{00}\). Finally, \(e_{ij}\) represents the normally distributed individual error term with a mean of 0 and estimated variance of \(\sigma^2_{e0}\). Here, the fixed part of the equation is \(\gamma_{00} + \beta_{1} X_{1ij} + \gamma_{01} Z_j\), and the random portion is still \(e_{ij} + u_{0j}\). Notice that this random portion only applies to the intercept, not the slope.

Let’s simulate data to follow these equations, a random intercept model with a Level 1 predictor and a Level 2 predictor. First we’ll define the sample size.


N <- 50 # Sample size per cluster
C <- 20 # Number of clusters

Then we’ll define the fixed portion of our model. We’ll set \(\gamma_{00}\), \(\gamma_{01}\), and \(\beta_1\) to single values as was done for the variance components model. However, \(Z_j\) and \(X_{1ij}\) will be drawn randomly from normal distributions and save to objects called Zj and X1ij respectively. Since we want to simulate these data as having meaningful differences between clusters, we’re going to draw these variables from normal distributions separately for each cluster.


# Fixed
g00 <- 2 
g01 <- 3 

set.seed(111)
Zj <- as.vector(replicate(n = C, 
                          expr = rnorm(1, mean = 0, sd = 3))) # Level 2 predictor
Zj <- rep(x = Zj, each = N)

B1 <- 4 
set.seed(555)
X1ij <- as.vector(replicate(n = C, 
                            expr = rnorm(N, 0, 1))) # Level 1 predictor

Finally, we’ll simulate the random portion of our model. We’ll create tau2, sigma2, u0j and eij the same as we did for the variance components model.


# Random
tau2 <- 5
sigma2 <- 3

set.seed(707)
u0j <- rnorm(n = C, mean = 0, sd = sqrt(tau2)) 
u0j <- rep(x = u0j, each = N)

set.seed(111)
eij <- rnorm(n = N*C, mean = 0, sd = sqrt(sigma2))

Now we can create yij again using a simple linear combination of all our created variables. Then we can merge it into a data frame and conduct a reasonableness check on the simulated data.


yij <- g00 + g01*Zj + u0j + B1*X1ij + eij

## Merging into a dataframe
ID <- 1:(N*C) 
Cluster <- rep(x = seq(1,C), each = N) 
df <- data.frame(y = yij,
                 ID = ID,
                 Cluster = Cluster,
                 g00 = rep(x = g00, times = N*C),
                 g01 = rep(x = g01, times = N*C),
                 Z = Zj,
                 u0j = u0j,
                 B1 = rep(x = B1, times = N*C),
                 X = X1ij,
                 eij = eij)
head(df)


           y ID Cluster g00 g01         Z       u0j B1          X        eij
1  0.3386258  1       1   2   3 0.7056621 -2.866334  4 -0.3298603  0.4074142
2  2.6923872  2       1   2   3 0.7056621 -2.866334  4  0.5036464 -0.5728513
3  2.2083828  3       1   2   3 0.7056621 -2.866334  4  0.3743696 -0.5397483
4  4.8173524  4       1   2   3 0.7056621 -2.866334  4  1.8886198 -3.9877797
5 -6.1649241  5       1   2   3 0.7056621 -2.866334  4 -1.7799027 -0.2959660
6  5.0361463  6       1   2   3 0.7056621 -2.866334  4  0.8856311  0.2429690


tail(df)


             y   ID Cluster g00 g01       Z      u0j B1           X        eij
995   5.209822  995      20   2   3 1.09256 2.378905  4 -0.80014180  0.7538043
996   5.796583  996      20   2   3 1.09256 2.378905  4 -0.07395561 -1.5641802
997   4.734462  997      20   2   3 1.09256 2.378905  4 -0.83338705  0.4114252
998   4.936531  998      20   2   3 1.09256 2.378905  4 -1.02491019  1.3795863
999  12.883061  999      20   2   3 1.09256 2.378905  4  1.18937051  0.4689932
1000  6.658694 1000      20   2   3 1.09256 2.378905  4 -0.18230970 -0.2686525

We can see that we have unique individuals assigned to different clusters, we have the exact number of rows we anticipated (20 x 50 = 1,000), each individual has the same value for g00, g01, and B1, each cluster has a different value for Z and u0j, and each individual has a different value for y, X and eij. We can now use this data to produce a model with two predictors and a random intercept:


m1 <- lmer(y ~ 1 + Z + X + (1 | Cluster), data = df)
summary(m1)


Linear mixed model fit by REML ['lmerMod']
Formula: y ~ 1 + Z + X + (1 | Cluster)
   Data: df

REML criterion at convergence: 4005.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4606 -0.6833  0.0414  0.6487  3.0522 

Random effects:
 Groups   Name        Variance Std.Dev.
 Cluster  (Intercept) 5.618    2.370   
 Residual             2.927    1.711   
Number of obs: 1000, groups:  Cluster, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  2.17320    0.57279   3.794
Z            2.81310    0.19821  14.193
X            4.02161    0.05583  72.035

Correlation of Fixed Effects:
  (Intr) Z    
Z 0.367       
X 0.006  0.009

As before, we will begin by looking at the “Scaled residuals:” section. It reveals that the error distribution has a median of 0.0414, accompanied by roughly equivalent absolute values for the first and third quartiles as well as roughly equivalent absolute values for the minimum and maximum. This consistency suggests that the error term follows a normal distribution with a mean of 0, but in practice further testing should be conducted to confirm this.

In the “Random effects:” section we focus on the variance of “Cluster (Intercept)” is 5.447, very close to tau2 with a true value of 5. The variance for “Residual” is 3.067, extremely close to the true sigma2 value of 3.

Moving to the “Fixed effects:” section, we look at the estimates for \(\gamma_{00}\), \(\gamma_{01}\), and \(\beta_1\). The “Estimate” for “(Intercept)” is 2.17656, very close to the true value of 2 for g00. The “Estimate” for “Zj” is 2.81563, very close to the true g01 value of 3, and for “X1ij” the estimated value of 4.1107 is again very close to the true value of 4 for B1.

The final section, “Correlation of Fixed Effects:” shows the correlation between the fixed effects. The correlation between X and Z is 0.009. As is stated, this specifically refers to the correlation between the fixed effects, not to the correlation between the variables themselves. For more information on what this means see this article on the correlation of fixed effects.

Simulating a Random Slope

Now let’s go through how to simulate data coming from a model with a random slope. We will consider a model with one Level 1 predictor and both a random intercept and random slope. The equations are as follows:

Level 1: \(y_{ij} = \beta_{0j} + \beta_{1j} X_{1ij} + e_{ij}; e_{ij} \sim N(0,\sigma^2_{e0})\),

Level 2: \(\beta_{0j} = \gamma_{00} + u_{0j}; u_{0j} \sim N(0, \sigma^2_{u0})\); \(\beta_{1j} = \gamma_{01} + u_{1j}; u_{1j} \sim N(0, \sigma^2_{u1})\),

Composite: \(y_{ij} = \gamma_{00} + u_{0j} + (\gamma_{01} + u_{1j}) \times X_{1ij} + e_{ij}\).

The main difference between this model and the above random intercept model is the introduction of \(\beta_{1j}\), which varies by clusters and is determined by \(\gamma_{01}\) (the fixed effect of \(X_{1ij}\) or the average slope across all clusters), and \(u_{1j}\), which is the random effect of \(X_{1ij}\) and comes from a normal distribution with a mean of 0 and an estimated variance of \(\sigma^2_{u1}\).

We will use the same sample size and number of clusters as before.


N <- 50
C <- 20

Next we will set the values for the fixed portion of the model as before.


# Fixed
g00 <- 2
g01 <- 5

And finally we will simulate the random portion of the model.


# Random
tau2_u0 <- 5 
set.seed(707)
u0j <- rnorm(n = C, mean = 0, sd = sqrt(tau2_u0)) 
u0j <- rep(x = u0j, each = N)
b0j <- g00 + u0j

tau2_u1 <- 3 
set.seed(5707)
u1j <- rnorm(n = C, mean = 0, sd = sqrt(tau2_u1)) 
u1j <- rep(x = u1j, each = N)

b1j <- g01 + u1j

sigma2 <- 3
set.seed(111)
eij <- rnorm(n = N*C, mean = 0, sd = sqrt(sigma2))

Next we can create a Level 1 predictor called X1ij.


# Predictor
set.seed(555)
X1ij <- as.vector(replicate(n = C, 
                            expr = rnorm(N, 0, 1))) # Level 1 predictor

Finally, we will combine all of this to create an outcome variable, yij.


yij <- b0j + b1j*X1ij + eij

As before, we will create a data frame containing all these simulated values.


# Merging into a dataframe
ID <- 1:(N*C) # Creating a unique id for each individual
Cluster <- rep(x = seq(1,C), each = N) # Creating a cluster variable
df <- data.frame(y = yij,
                 ID = ID,
                 Cluster = Cluster,
                 eij = eij,
                 u0j = u0j,
                 g00 = rep(x = g00, times = N*C),
                 u1j = u1j,
                 g01 = rep(x = g01, times = N*C),
                 b0j = b0j,
                 b1j = b1j,
                 X = X1ij)

head(df)


            y ID Cluster        eij       u0j g00       u1j g01        b0j
1  -2.3706838  1       1  0.4074142 -2.866334   2 0.7956791   5 -0.8663336
2   1.4797882  2       1 -0.5728513 -2.866334   2 0.7956791   5 -0.8663336
3   0.7636439  3       1 -0.5397483 -2.866334   2 0.7956791   5 -0.8663336
4   6.0917210  4       1 -3.9877797 -2.866334   2 0.7956791   5 -0.8663336
5 -11.4780445  5       1 -0.2959660 -2.866334   2 0.7956791   5 -0.8663336
6   4.5094691  6       1  0.2429690 -2.866334   2 0.7956791   5 -0.8663336
       b1j          X
1 5.795679 -0.3298603
2 5.795679  0.5036464
3 5.795679  0.3743696
4 5.795679  1.8886198
5 5.795679 -1.7799027
6 5.795679  0.8856311


tail(df)


              y   ID Cluster        eij      u0j g00       u1j g01      b0j
995   1.0058358  995      20  0.7538043 2.378905   2 0.1576774   5 4.378905
996   2.4332853  996      20 -1.5641802 2.378905   2 0.1576774   5 4.378905
997   0.4919884  997      20  0.4114252 2.378905   2 0.1576774   5 4.378905
998   0.4723349  998      20  1.3795863 2.378905   2 0.1576774   5 4.378905
999  10.9822873  999      20  0.4689932 2.378905   2 0.1576774   5 4.378905
1000  3.1699576 1000      20 -0.2686525 2.378905   2 0.1576774   5 4.378905
          b1j           X
995  5.157677 -0.80014180
996  5.157677 -0.07395561
997  5.157677 -0.83338705
998  5.157677 -1.02491019
999  5.157677  1.18937051
1000 5.157677 -0.18230970


m2 <- lmer(yij ~ X + (1 + X | Cluster), data = df)
summary(m2)


Linear mixed model fit by REML ['lmerMod']
Formula: yij ~ X + (1 + X | Cluster)
   Data: df

REML criterion at convergence: 4040.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3414 -0.6460  0.0103  0.6403  3.2936 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Cluster  (Intercept) 5.631    2.373        
          X           1.138    1.067    0.17
 Residual             2.868    1.694        
Number of obs: 1000, groups:  Cluster, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)   2.3595     0.5334   4.424
X             5.1799     0.2451  21.135

Correlation of Fixed Effects:
  (Intr)
X 0.161

Looking at the “Scaled residuals:” section suggests a symmetrical distribution with a center of 0. As before, to ensure this is indeed a normal distribution would require follow-up diagnostics.

In the “Random effects:” section, looking at the variance of “Cluster (Intercept)” is 5.631, very close to the true tau2_u0 value of 5. The variance of “X” is 1.138, which is not as close to the true tau2_u1 value of 3. Lastly in this section, the variance for “Residual” is 2.868, extremely close to the true sigma2 value of 3.

Turning to the “Fixed effects:” section, we look at the estimates for \(\gamma_{00}\), \(\gamma_{01}\), and \(\beta_1\). The “Estimate” for “(Intercept)” is 2.3595, very close to the true value of 2 for g00. The “Estimate” for “X” is 5.1799, very close to the true g01 value of 5.

Notice that the “Random effects:” section now includes a correlation between the random intercept and random slope, shown to be .17. Despite the estimated positive correlation, we didn’t simulate our data with correlated random effects. This estimated correlation is simply a point estimate. To get a better sense of this estimate we can look at a confidence interval on our random effects using the confint() function. Setting parm = "theta_" restricts confidence interval calculations to the random effects. Setting oldNames = FALSE provides easier-to-read names in the output.


confint(object = m2, parm = "theta_", oldNames = FALSE)


Computing profile confidence intervals ...

                               2.5 %    97.5 %
sd_(Intercept)|Cluster     1.7388424 3.2756336
cor_X.(Intercept)|Cluster -0.2846315 0.5567918
sd_X|Cluster               0.7661714 1.4878505
sigma                      1.6205691 1.7722665

Notice the confidence interval on the correlation is wide and uncertain [-0.28, 0.56], reflecting the fact the data was simulated without correlated random effects.

Above, we drew \(u_{0j}\) and \(u_{1j}\) independently from separate populations with no correlation. The estimate of 0.17 is purely from noise. But let’s say we want to simulate data where there is a non-zero correlation between the random intercept and random slope. Let’s try drawing \(u_{0j}\) and \(u_{1j}\) from a multivariate normal distribution, with a set covariance between them. To do this, instead of creating each variable separately using rnorm() we would draw them from a specified multivariate normal distribution. This can be easily done using the mvrnorm() function from the MASS package (v7.3-58.3; Venables & Ripley, 2002).

The mvrnorm() function takes sample size and mean value arguments, similar to the rnorm() function, however instead of an “sd” argument it has “Sigma”. “Sigma” should be the covariance matrix between the random variables. For this argument, we will create a matrix called var_mat. This matrix will contain the variances of \(u_{0j}\) and \(u_{1j}\) in the diagonal elements, and the covariance between them in the off diagonal elements. We will set their covariance to be 1. Finally, we can use the cov2cor() function to see what this matrix would be as correlations since this is the value we’ll get back in the model summary.


# Random
tau2_u0 <- 5
tau2_u1 <- 3
Cov_u1j.u0j <- 1
var_mat <- matrix(data = c(tau2_u0, Cov_u1j.u0j, 
                           Cov_u1j.u0j, tau2_u1), nrow = 2, ncol = 2)
var_mat


     [,1] [,2]
[1,]    5    1
[2,]    1    3


cov2cor(var_mat)


          [,1]      [,2]
[1,] 1.0000000 0.2581989
[2,] 0.2581989 1.0000000

Now we can load the MASS package and run the mvrnorm() function and save the results into a matrix called u. u will have two columns, one for each variable \(u_{0j}\) and \(u_{1j}\), and will have 20 rows, one for each cluster. To double check that everything is working as expected, let’s compare the variance-covariance matrix of the simulated variables to the var_mat values and its equivalent correlation matrix.


library(MASS)
set.seed(609)
u <- mvrnorm(n = C, mu = c(0, 0), Sigma = var_mat)
var(u)


         [,1]     [,2]
[1,] 5.458752 1.287149
[2,] 1.287149 3.161914


cor(u)


          [,1]      [,2]
[1,] 1.0000000 0.3098185
[2,] 0.3098185 1.0000000

The values are very similar to the values from var_mat and cov2cor(var_mat). To create u0j and u1j we will pull each variable from the u matrix. Then we repeat the values of these vectors N times, as before, so that individuals in each cluster have the same value for this variable.


u0j <- u[,1]
u1j <- u[,2]
u0j <- rep(x = u0j, each = N)
u1j <- rep(x = u1j, each = N)

From these values, we will need to recalculate b0j, b1j, and yij, and replace the variables in df to reflect these values.


b0j <- g00 + u0j

b1j <- g01 + u1j

yij <- b0j + b1j*X1ij + eij

# Merging into a dataframe
df <- data.frame(y = yij,
                 ID = ID,
                 Cluster = Cluster,
                 eij = eij,
                 u0j = u0j,
                 g00 = rep(x = g00, times = N*C),
                 u1j = u1j,
                 g01 = rep(x = g01, times = N*C),
                 b0j = b0j,
                 b1j = b1j,
                 X = X1ij)

And finally, we can create another model with these new data.


m3 <- lmer(yij ~ X + (1 + X | Cluster), data = df)
summary(m3)


Linear mixed model fit by REML ['lmerMod']
Formula: yij ~ X + (1 + X | Cluster)
   Data: df

REML criterion at convergence: 4057.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3610 -0.6590  0.0007  0.6611  3.3852 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Cluster  (Intercept) 5.362    2.316        
          X           3.104    1.762    0.26
 Residual             2.868    1.694        
Number of obs: 1000, groups:  Cluster, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)   2.7071     0.5206    5.20
X             5.4977     0.3980   13.81

Correlation of Fixed Effects:
  (Intr)
X 0.259

We can see that compared to m2, the estimates for each value are fairly similar and that the correlation between the random effects is now 0.26.

References

Bates, D., Maechler, M., Bolker, B., Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/
Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer.

Laura Jamison
StatLab Associate
University of Virginia Library
September 13, 2023

For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.