The term “multilevel data” refers to data organized in a hierarchical structure, where units of analysis are grouped into clusters. For example, in a cross-sectional study, multilevel data could be made up of individual measurements of students from different schools, where students are nested within schools. In a longitudinal study, multilevel data could be made up of multiple time point measurements of individuals, where time points are nested within individuals. To be appropriately modeled, multilevel data should be analyzed using multilevel models, also known as hierarchical linear regression or mixed effects models.
Multilevel modeling extends the Ordinary Least Squares (OLS) model in order to accommodate this hierarchical structure in the data. An OLS model with one predictor can be represented as:
\(y = \beta_0 + \beta_1 x_1 + e\),
where the dependent variable, \(y\), is a linear combination of the intercept \(\beta_0\), an independent variable \(x_1\) multiplied by slope \(\beta_1\), plus some error \(e\). This model assumes that \(e\) follows an independently and identically distributed (i.i.d.) normal distribution with a mean of 0 and constant variance, denoted as \(e \sim N(0, \sigma^2)\).
In multilevel data, individual observations within clusters could exhibit stronger correlations with individuals from the same cluster compared to their correlation to individuals from other clusters. Due to this, using an OLS model and failing to properly account for the clustering variable could lead to a violation of the i.i.d. assumption of the error term (\(e\)). To address this, one approach is to simply add the clustering variable as a covariate in the OLS model (e.g., \(x_2\)). However, this approach does not properly account for the correlation of observations within clusters. Moreover, if there are any other variables that are collinear with the cluster variable, this method is particularly inappropriate. This could introduce multicollinearity into the model, leading to unstable coefficient estimates and therefore preventing a good interpretation of the true effect of this variable. Instead, constructing a multilevel model properly accounts for the correlation of observations within clusters, and enables the inclusion and investigation of covariates specifically related to the clustering variable as variables of interest.
Multilevel data is typically described in levels. Level 1 represents the lowest unit of analysis, nested within Level 2. For instance, in our cross-sectional example of students nested within schools, students would be Level 1 and the schools would be Level 2. If our data looked at individuals across time, Level 1 would be the measurements at each time point and Level 2 would be the individuals. If we repeatedly measured students within schools, we would have three level data where Level 1 would be the time points, Level 2 would be the students, and Level 3 would be the schools.
The most basic multilevel model contains no predictors and is often referred to as a variance components model. The main difference between expressing an OLS model and a multilevel model is that we will now have equations for each level.
Level 1: \(y_{ij} = \beta_{0j} + e_{ij}; e_{ij} \sim N(0,\sigma^2_{e0})\),
Level 2: \(\beta_{0j} = \gamma_{00} + u_{0j}; u_{0j} \sim N(0,\tau^2_{u0})\),
Composite: \(y_{ij} = \gamma_{00} + u_{0j} + e_{ij}\).
In this model, \(y_{ij}\) is the response variable \(y\) for the \(i\)th person in the \(j\)th cluster, \(\beta_{0j}\) is the intercept which is now determined by \(\gamma_{00}\) and \(u_{0j}\). \(\gamma_{00}\) is the grand mean of \(y\). We can think of this equation as being split into two portions, a fixed portion and a random portion.
Fixed effects will be constant across individuals and clusters, in other words the parameter is applied to the sample as a whole. Random means that we believe that these parameters/data came from random sampling of a larger population. In other words, the random effects are not constant and will vary by individual or cluster. Therefore, the fixed portion of this model is \(\gamma_{00}\) and the random portion is \(u_{0j}\) and \(e_{ij}\) since they are drawn from a population distribution. \(u_{0j}\) is the cluster dependent portion of the intercept and will have a different value for each cluster that gets added to the grand mean or overall intercept, \(\gamma_{00}\). \(u_{0j}\) is assumed to come from a normal distribution with a mean of 0 and an estimated variance, \(\tau^2_{u0}\). \(e_{ij}\) is the residual for the \(i\)th individual in the \(j\)th cluster and is normally distributed with a mean of 0 and an estimated variance, \(\sigma^2_{e0}\).
Simulating a Variance Components Model
First, let’s go through how to simulate data from a variance components model. All of the following code is written for R Statistical Software (v4.2.2; R Core Team 2022).
First we will define the number of clusters, C
and the sample size per cluster, N
:
C <- 20 # Number of clusters
N <- 50 # Sample size per cluster
Now we can set the values for our fixed variable, \(\gamma_{00}\), and save it to an object called g00
:
# Fixed
g00 <- 2 # grand mean of outcome y
Finally we will define the random values, \(u_{0j}\) and \(e_{ij}\), and save them to objects called u0j
and eij
, respectively. It is important to note that both variables are assumed to be drawn from normal distributions. To accomplish this, we use the rnorm()
function, making sure to use set.seed()
each time for replicability. Let’s start by generating u0j
, which represents the cluster-specific deviation of each intercept from the fixed intercept. This means we want to draw a single value for each cluster from a normal distribution with a mean of 0 and variance of \(\tau^2_{u0}\), which we’ll call tau2
and set to 5. The u0j
object now contains one value for each cluster. We will repeat each of these values N
times so that each individual in the cluster has the same value.
# Random
tau2 <- 5
set.seed(707)
u0j <- rnorm(n = C, mean = 0, sd = sqrt(tau2))
u0j <- rep(x = u0j, each = N)
Next we can define eij
which is the individual (Level 1) specific deviation from the true value of \(y\). This means each person will have their own eij
value which will be drawn from a normal distribution with a mean of 0 and variance \(\sigma^2_{e0}\) which we’ll call sigma2
and set to 3.
# Random
sigma2 <- 3 # variance of eij
set.seed(777)
eij <- rnorm(n = N*C, mean = 0, sd = sqrt(sigma2))
And finally we can follow the equation \(y_{ij} = \gamma_{00} + u_{0j} + e_{ij}\) to calculate \(y_{ij}\). We will simply compute a linear combination of g00
, u0j
, and eij
and save that into an object called yij
.
yij <- g00 + u0j + eij
Let’s test how well this it worked by building a multilevel model. Before building the model, we’ll need to combine everything into a data frame. To create the variance components model, we’ll need yij
and the cluster identifier in the data frame, but for a reasonableness check we’ll put everything in there. First we’ll create a unique identifier for each individual case, then an identifier for the cluster they’re in. Finally everything will be combined into a data frame. Since g00
is only one value and will be the same for each individual, we repeat it for all individuals.
# Merging into a dataframe
ID <- 1:(N*C) # Creating a unique id for each individual
Cluster <- rep(x = seq(1,C), each = N) # Creating a cluster variable
df <- data.frame(y = yij,
ID = ID,
Cluster = Cluster,
eij = eij,
u0j = u0j,
g00 = rep(x = g00, times = N*C))
head(df)
y ID Cluster eij u0j g00
1 -0.01799895 1 1 0.8483346 -2.866334 2
2 -1.55662749 2 1 -0.6902939 -2.866334 2
3 0.01846090 3 1 0.8847945 -2.866334 2
4 -1.55709630 4 1 -0.6907627 -2.866334 2
5 1.97195395 5 1 2.8382875 -2.866334 2
6 0.20974452 6 1 1.0760781 -2.866334 2
tail(df)
y ID Cluster eij u0j g00
995 5.947669 995 20 1.56876405 2.378905 2
996 3.524053 996 20 -0.85485166 2.378905 2
997 3.389301 997 20 -0.98960329 2.378905 2
998 4.355839 998 20 -0.02306619 2.378905 2
999 4.185742 999 20 -0.19316325 2.378905 2
1000 3.980374 1000 20 -0.39853087 2.378905 2
We can see that we have unique individuals assigned to different clusters and we have the exact number of rows we anticipated (20 x 50 = 1,000). Within cluster, each individual has the same value for u0j
, they have different values for y
and eij
, and across clusters each individual has the same value for g00
. Now let’s build a variance components model. To do this, we will need to load the lme4 package (v1.1-32; Bates et al., 2015) to use the lmer()
function which will create the model.
library(lme4)
The lmer()
function has two main components: the formula and the data. The formula follows the same “outcome ~ predictors” format from the commonly used lm()
function. The predictors section accommodates standard operators (e.g., +
, *
, :
, etc.) along with an additional section to define the level(s) or cluster variable(s). For a variance components model, we will have no predictors and will specify the random intercept to be conditional on the cluster variable using (1 | Cluster)
. This notation can be read as follows: “the model intercept, 1, is conditional on the cluster.”
# Building a model
m0 <- lmer(y ~ 1 + (1 | Cluster), data = df)
summary(m0)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ 1 + (1 | Cluster)
Data: df
REML criterion at convergence: 4011.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.7718 -0.6717 0.0068 0.6572 3.3023
Random effects:
Groups Name Variance Std.Dev.
Cluster (Intercept) 5.930 2.435
Residual 2.954 1.719
Number of obs: 1000, groups: Cluster, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.3557 0.5472 4.305
The output of this model contains several sections. We want to see how closely the model output resembles our simulated conditions.
First we’ll look at the “Scaled residuals:” section to determine if our error term follows a normal distribution with a mean of 0. We see that the center of the distribution, indicated by the “Median”, is 0.0068. Additionally, the absolute values of the first and third quartiles are roughly equivalent, and the absolute values of the minimum and maximum are roughly equivalent. This suggests that the model error comes from a normal distribution (symmetric with a center of 0), however, in practice, follow-up diagnostics should be conducted.
Next we look at the “Random effects:” portion. We simulated \(\tau^2_{u0}\) equal to 5. Comparing this with the variance of the “Cluster (Intercept)”, which is 5.930, we find that the estimated variance is in a reasonable range compared to the true value of 5. Likewise, we simulated our data with \(\sigma^2_{e0}\) equal to 3, and the estimated variance of the “Residual” in the output is reported as 2.954, indicating a reasonably close match.
Finally, in the “Fixed effects:” section we’ll look for the grand mean of \(y\), denoted as \(\gamma_{00}\). We simulated this value equal to 2, which is very close to the recovered value of 2.3557.
Overall, the output of the model is very close to our simulated conditions.
Adding Predictors
To the variance components model, we can introduce both Level 1 and Level 2 predictors. A Level 1 predictor is any data specific to the individual units at Level 1. For instance, in a two-level cross-sectional model with students at Level 1 and schools at Level 2, a Level 1 predictor could be the GRE scores of each student. A Level 2 predictor is any data specific to the Level 2 units, such as average classroom size for each school. Introducing both a Level 1 and a Level 2 predictor into the model modifies the above equations to:
Level 1: \(y_{ij} = \beta_{0j} + \beta_{1} X_{1ij} + e_{ij}; e_{ij} \sim N(0,\sigma^2_{e0})\),
Level 2: \(\beta_{0j} = \gamma_{00} + \gamma_{01} Z_j + u_{0j}; u_{0j} \sim N(0,\tau^2_{u0})\),
Composite: \(y_{ij} = \gamma_{00} + \gamma_{01} Z_j + u_{0j} + \beta_{1} X_{1ij} + e_{ij}\).
The Level 1 equation includes the Level 1 predictor, \(X_{1ij}\) with its corresponding coefficient, \(\beta_1\). The Level 2 equation now incorporates the Level 2 predictor, \(Z_j\), and its coefficient, \(\gamma_{01}\). The intercept \(\beta_{0j}\) is determined by the combination of the fixed intercept parameters \(\gamma_{00}\) and \(\gamma_{01}\), as well as the random cluster-specific deviation \(u_{0j}\).
\(\gamma_{00}\) is now the expected value of \(y_{ij}\) when all other predictors are 0. \(\gamma_{01}\) is the mean difference between clusters who differ by 1 on \(Z_j\). \(u_{0j}\) is still assumed to follow a normal distribution with a mean of 0 and estimated variance \(\tau^2_{u0}\) and is assumed to be uncorrelated with \(X, Z,\) and \(e_{ij}\). \(u_{0j}\) can be interpreted as the Level 2 group variation around \(\gamma_{00}\). Finally, \(e_{ij}\) represents the normally distributed individual error term with a mean of 0 and estimated variance of \(\sigma^2_{e0}\). Here, the fixed part of the equation is \(\gamma_{00} + \beta_{1} X_{1ij} + \gamma_{01} Z_j\), and the random portion is still \(e_{ij} + u_{0j}\). Notice that this random portion only applies to the intercept, not the slope.
Let’s simulate data to follow these equations, a random intercept model with a Level 1 predictor and a Level 2 predictor. First we’ll define the sample size.
N <- 50 # Sample size per cluster
C <- 20 # Number of clusters
Then we’ll define the fixed portion of our model. We’ll set \(\gamma_{00}\), \(\gamma_{01}\), and \(\beta_1\) to single values as was done for the variance components model. However, \(Z_j\) and \(X_{1ij}\) will be drawn randomly from normal distributions and save to objects called Zj
and X1ij
respectively. Since we want to simulate these data as having meaningful differences between clusters, we’re going to draw these variables from normal distributions separately for each cluster.
# Fixed
g00 <- 2
g01 <- 3
set.seed(111)
Zj <- as.vector(replicate(n = C,
expr = rnorm(1, mean = 0, sd = 3))) # Level 2 predictor
Zj <- rep(x = Zj, each = N)
B1 <- 4
set.seed(555)
X1ij <- as.vector(replicate(n = C,
expr = rnorm(N, 0, 1))) # Level 1 predictor
Finally, we’ll simulate the random portion of our model. We’ll create tau2
, sigma2
, u0j
and eij
the same as we did for the variance components model.
# Random
tau2 <- 5
sigma2 <- 3
set.seed(707)
u0j <- rnorm(n = C, mean = 0, sd = sqrt(tau2))
u0j <- rep(x = u0j, each = N)
set.seed(111)
eij <- rnorm(n = N*C, mean = 0, sd = sqrt(sigma2))
Now we can create yij
again using a simple linear combination of all our created variables. Then we can merge it into a data frame and conduct a reasonableness check on the simulated data.
yij <- g00 + g01*Zj + u0j + B1*X1ij + eij
## Merging into a dataframe
ID <- 1:(N*C)
Cluster <- rep(x = seq(1,C), each = N)
df <- data.frame(y = yij,
ID = ID,
Cluster = Cluster,
g00 = rep(x = g00, times = N*C),
g01 = rep(x = g01, times = N*C),
Z = Zj,
u0j = u0j,
B1 = rep(x = B1, times = N*C),
X = X1ij,
eij = eij)
head(df)
y ID Cluster g00 g01 Z u0j B1 X eij
1 0.3386258 1 1 2 3 0.7056621 -2.866334 4 -0.3298603 0.4074142
2 2.6923872 2 1 2 3 0.7056621 -2.866334 4 0.5036464 -0.5728513
3 2.2083828 3 1 2 3 0.7056621 -2.866334 4 0.3743696 -0.5397483
4 4.8173524 4 1 2 3 0.7056621 -2.866334 4 1.8886198 -3.9877797
5 -6.1649241 5 1 2 3 0.7056621 -2.866334 4 -1.7799027 -0.2959660
6 5.0361463 6 1 2 3 0.7056621 -2.866334 4 0.8856311 0.2429690
tail(df)
y ID Cluster g00 g01 Z u0j B1 X eij
995 5.209822 995 20 2 3 1.09256 2.378905 4 -0.80014180 0.7538043
996 5.796583 996 20 2 3 1.09256 2.378905 4 -0.07395561 -1.5641802
997 4.734462 997 20 2 3 1.09256 2.378905 4 -0.83338705 0.4114252
998 4.936531 998 20 2 3 1.09256 2.378905 4 -1.02491019 1.3795863
999 12.883061 999 20 2 3 1.09256 2.378905 4 1.18937051 0.4689932
1000 6.658694 1000 20 2 3 1.09256 2.378905 4 -0.18230970 -0.2686525
We can see that we have unique individuals assigned to different clusters, we have the exact number of rows we anticipated (20 x 50 = 1,000), each individual has the same value for g00
, g01
, and B1
, each cluster has a different value for Z
and u0j
, and each individual has a different value for y
, X
and eij
. We can now use this data to produce a model with two predictors and a random intercept:
m1 <- lmer(y ~ 1 + Z + X + (1 | Cluster), data = df)
summary(m1)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ 1 + Z + X + (1 | Cluster)
Data: df
REML criterion at convergence: 4005.2
Scaled residuals:
Min 1Q Median 3Q Max
-3.4606 -0.6833 0.0414 0.6487 3.0522
Random effects:
Groups Name Variance Std.Dev.
Cluster (Intercept) 5.618 2.370
Residual 2.927 1.711
Number of obs: 1000, groups: Cluster, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.17320 0.57279 3.794
Z 2.81310 0.19821 14.193
X 4.02161 0.05583 72.035
Correlation of Fixed Effects:
(Intr) Z
Z 0.367
X 0.006 0.009
As before, we will begin by looking at the “Scaled residuals:” section. It reveals that the error distribution has a median of 0.0414, accompanied by roughly equivalent absolute values for the first and third quartiles as well as roughly equivalent absolute values for the minimum and maximum. This consistency suggests that the error term follows a normal distribution with a mean of 0, but in practice further testing should be conducted to confirm this.
In the “Random effects:” section we focus on the variance of “Cluster (Intercept)” is 5.447, very close to tau2
with a true value of 5. The variance for “Residual” is 3.067, extremely close to the true sigma2
value of 3.
Moving to the “Fixed effects:” section, we look at the estimates for \(\gamma_{00}\), \(\gamma_{01}\), and \(\beta_1\). The “Estimate” for “(Intercept)” is 2.17656, very close to the true value of 2 for g00
. The “Estimate” for “Zj” is 2.81563, very close to the true g01
value of 3, and for “X1ij” the estimated value of 4.1107 is again very close to the true value of 4 for B1
.
The final section, “Correlation of Fixed Effects:” shows the correlation between the fixed effects. The correlation between X
and Z
is 0.009. As is stated, this specifically refers to the correlation between the fixed effects, not to the correlation between the variables themselves. For more information on what this means see this article on the correlation of fixed effects.
Simulating a Random Slope
Now let’s go through how to simulate data coming from a model with a random slope. We will consider a model with one Level 1 predictor and both a random intercept and random slope. The equations are as follows:
Level 1: \(y_{ij} = \beta_{0j} + \beta_{1j} X_{1ij} + e_{ij}; e_{ij} \sim N(0,\sigma^2_{e0})\),
Level 2: \(\beta_{0j} = \gamma_{00} + u_{0j}; u_{0j} \sim N(0, \sigma^2_{u0})\); \(\beta_{1j} = \gamma_{01} + u_{1j}; u_{1j} \sim N(0, \sigma^2_{u1})\),
Composite: \(y_{ij} = \gamma_{00} + u_{0j} + (\gamma_{01} + u_{1j}) \times X_{1ij} + e_{ij}\).
The main difference between this model and the above random intercept model is the introduction of \(\beta_{1j}\), which varies by clusters and is determined by \(\gamma_{01}\) (the fixed effect of \(X_{1ij}\) or the average slope across all clusters), and \(u_{1j}\), which is the random effect of \(X_{1ij}\) and comes from a normal distribution with a mean of 0 and an estimated variance of \(\sigma^2_{u1}\).
We will use the same sample size and number of clusters as before.
N <- 50
C <- 20
Next we will set the values for the fixed portion of the model as before.
# Fixed
g00 <- 2
g01 <- 5
And finally we will simulate the random portion of the model.
# Random
tau2_u0 <- 5
set.seed(707)
u0j <- rnorm(n = C, mean = 0, sd = sqrt(tau2_u0))
u0j <- rep(x = u0j, each = N)
b0j <- g00 + u0j
tau2_u1 <- 3
set.seed(5707)
u1j <- rnorm(n = C, mean = 0, sd = sqrt(tau2_u1))
u1j <- rep(x = u1j, each = N)
b1j <- g01 + u1j
sigma2 <- 3
set.seed(111)
eij <- rnorm(n = N*C, mean = 0, sd = sqrt(sigma2))
Next we can create a Level 1 predictor called X1ij
.
# Predictor
set.seed(555)
X1ij <- as.vector(replicate(n = C,
expr = rnorm(N, 0, 1))) # Level 1 predictor
Finally, we will combine all of this to create an outcome variable, yij
.
yij <- b0j + b1j*X1ij + eij
As before, we will create a data frame containing all these simulated values.
# Merging into a dataframe
ID <- 1:(N*C) # Creating a unique id for each individual
Cluster <- rep(x = seq(1,C), each = N) # Creating a cluster variable
df <- data.frame(y = yij,
ID = ID,
Cluster = Cluster,
eij = eij,
u0j = u0j,
g00 = rep(x = g00, times = N*C),
u1j = u1j,
g01 = rep(x = g01, times = N*C),
b0j = b0j,
b1j = b1j,
X = X1ij)
head(df)
y ID Cluster eij u0j g00 u1j g01 b0j
1 -2.3706838 1 1 0.4074142 -2.866334 2 0.7956791 5 -0.8663336
2 1.4797882 2 1 -0.5728513 -2.866334 2 0.7956791 5 -0.8663336
3 0.7636439 3 1 -0.5397483 -2.866334 2 0.7956791 5 -0.8663336
4 6.0917210 4 1 -3.9877797 -2.866334 2 0.7956791 5 -0.8663336
5 -11.4780445 5 1 -0.2959660 -2.866334 2 0.7956791 5 -0.8663336
6 4.5094691 6 1 0.2429690 -2.866334 2 0.7956791 5 -0.8663336
b1j X
1 5.795679 -0.3298603
2 5.795679 0.5036464
3 5.795679 0.3743696
4 5.795679 1.8886198
5 5.795679 -1.7799027
6 5.795679 0.8856311
tail(df)
y ID Cluster eij u0j g00 u1j g01 b0j
995 1.0058358 995 20 0.7538043 2.378905 2 0.1576774 5 4.378905
996 2.4332853 996 20 -1.5641802 2.378905 2 0.1576774 5 4.378905
997 0.4919884 997 20 0.4114252 2.378905 2 0.1576774 5 4.378905
998 0.4723349 998 20 1.3795863 2.378905 2 0.1576774 5 4.378905
999 10.9822873 999 20 0.4689932 2.378905 2 0.1576774 5 4.378905
1000 3.1699576 1000 20 -0.2686525 2.378905 2 0.1576774 5 4.378905
b1j X
995 5.157677 -0.80014180
996 5.157677 -0.07395561
997 5.157677 -0.83338705
998 5.157677 -1.02491019
999 5.157677 1.18937051
1000 5.157677 -0.18230970
m2 <- lmer(yij ~ X + (1 + X | Cluster), data = df)
summary(m2)
Linear mixed model fit by REML ['lmerMod']
Formula: yij ~ X + (1 + X | Cluster)
Data: df
REML criterion at convergence: 4040.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.3414 -0.6460 0.0103 0.6403 3.2936
Random effects:
Groups Name Variance Std.Dev. Corr
Cluster (Intercept) 5.631 2.373
X 1.138 1.067 0.17
Residual 2.868 1.694
Number of obs: 1000, groups: Cluster, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.3595 0.5334 4.424
X 5.1799 0.2451 21.135
Correlation of Fixed Effects:
(Intr)
X 0.161
Looking at the “Scaled residuals:” section suggests a symmetrical distribution with a center of 0. As before, to ensure this is indeed a normal distribution would require follow-up diagnostics.
In the “Random effects:” section, looking at the variance of “Cluster (Intercept)” is 5.631, very close to the true tau2_u0
value of 5. The variance of “X” is 1.138, which is not as close to the true tau2_u1
value of 3. Lastly in this section, the variance for “Residual” is 2.868, extremely close to the true sigma2
value of 3.
Turning to the “Fixed effects:” section, we look at the estimates for \(\gamma_{00}\), \(\gamma_{01}\), and \(\beta_1\). The “Estimate” for “(Intercept)” is 2.3595, very close to the true value of 2 for g00
. The “Estimate” for “X” is 5.1799, very close to the true g01
value of 5.
Notice that the “Random effects:” section now includes a correlation between the random intercept and random slope, shown to be .17. Despite the estimated positive correlation, we didn’t simulate our data with correlated random effects. This estimated correlation is simply a point estimate. To get a better sense of this estimate we can look at a confidence interval on our random effects using the confint()
function. Setting parm = "theta_"
restricts confidence interval calculations to the random effects. Setting oldNames = FALSE
provides easier-to-read names in the output.
confint(object = m2, parm = "theta_", oldNames = FALSE)
Computing profile confidence intervals ...
2.5 % 97.5 %
sd_(Intercept)|Cluster 1.7388424 3.2756336
cor_X.(Intercept)|Cluster -0.2846315 0.5567918
sd_X|Cluster 0.7661714 1.4878505
sigma 1.6205691 1.7722665
Notice the confidence interval on the correlation is wide and uncertain [-0.28, 0.56], reflecting the fact the data was simulated without correlated random effects.
Above, we drew \(u_{0j}\) and \(u_{1j}\) independently from separate populations with no correlation. The estimate of 0.17 is purely from noise. But let’s say we want to simulate data where there is a non-zero correlation between the random intercept and random slope. Let’s try drawing \(u_{0j}\) and \(u_{1j}\) from a multivariate normal distribution, with a set covariance between them. To do this, instead of creating each variable separately using rnorm()
we would draw them from a specified multivariate normal distribution. This can be easily done using the mvrnorm()
function from the MASS package (v7.3-58.3; Venables & Ripley, 2002).
The mvrnorm()
function takes sample size and mean value arguments, similar to the rnorm()
function, however instead of an “sd” argument it has “Sigma”. “Sigma” should be the covariance matrix between the random variables. For this argument, we will create a matrix called var_mat
. This matrix will contain the variances of \(u_{0j}\) and \(u_{1j}\) in the diagonal elements, and the covariance between them in the off diagonal elements. We will set their covariance to be 1. Finally, we can use the cov2cor()
function to see what this matrix would be as correlations since this is the value we’ll get back in the model summary.
# Random
tau2_u0 <- 5
tau2_u1 <- 3
Cov_u1j.u0j <- 1
var_mat <- matrix(data = c(tau2_u0, Cov_u1j.u0j,
Cov_u1j.u0j, tau2_u1), nrow = 2, ncol = 2)
var_mat
[,1] [,2]
[1,] 5 1
[2,] 1 3
cov2cor(var_mat)
[,1] [,2]
[1,] 1.0000000 0.2581989
[2,] 0.2581989 1.0000000
Now we can load the MASS package and run the mvrnorm()
function and save the results into a matrix called u
. u
will have two columns, one for each variable \(u_{0j}\) and \(u_{1j}\), and will have 20 rows, one for each cluster. To double check that everything is working as expected, let’s compare the variance-covariance matrix of the simulated variables to the var_mat
values and its equivalent correlation matrix.
library(MASS)
set.seed(609)
u <- mvrnorm(n = C, mu = c(0, 0), Sigma = var_mat)
var(u)
[,1] [,2]
[1,] 5.458752 1.287149
[2,] 1.287149 3.161914
cor(u)
[,1] [,2]
[1,] 1.0000000 0.3098185
[2,] 0.3098185 1.0000000
The values are very similar to the values from var_mat
and cov2cor(var_mat)
. To create u0j
and u1j
we will pull each variable from the u
matrix. Then we repeat the values of these vectors N
times, as before, so that individuals in each cluster have the same value for this variable.
u0j <- u[,1]
u1j <- u[,2]
u0j <- rep(x = u0j, each = N)
u1j <- rep(x = u1j, each = N)
From these values, we will need to recalculate b0j
, b1j
, and yij
, and replace the variables in df
to reflect these values.
b0j <- g00 + u0j
b1j <- g01 + u1j
yij <- b0j + b1j*X1ij + eij
# Merging into a dataframe
df <- data.frame(y = yij,
ID = ID,
Cluster = Cluster,
eij = eij,
u0j = u0j,
g00 = rep(x = g00, times = N*C),
u1j = u1j,
g01 = rep(x = g01, times = N*C),
b0j = b0j,
b1j = b1j,
X = X1ij)
And finally, we can create another model with these new data.
m3 <- lmer(yij ~ X + (1 + X | Cluster), data = df)
summary(m3)
Linear mixed model fit by REML ['lmerMod']
Formula: yij ~ X + (1 + X | Cluster)
Data: df
REML criterion at convergence: 4057.3
Scaled residuals:
Min 1Q Median 3Q Max
-3.3610 -0.6590 0.0007 0.6611 3.3852
Random effects:
Groups Name Variance Std.Dev. Corr
Cluster (Intercept) 5.362 2.316
X 3.104 1.762 0.26
Residual 2.868 1.694
Number of obs: 1000, groups: Cluster, 20
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.7071 0.5206 5.20
X 5.4977 0.3980 13.81
Correlation of Fixed Effects:
(Intr)
X 0.259
We can see that compared to m2
, the estimates for each value are fairly similar and that the correlation between the random effects is now 0.26.
References
- Bates, D., Maechler, M., Bolker, B., Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
- R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.r-project.org/
- Venables, W. N., & Ripley, B. D. (2002). Modern applied statistics with S (4th ed.). Springer.
Laura Jamison
StatLab Associate
University of Virginia Library
September 13, 2023
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.