Mixed Effect versus Fixed Effect Models

When faced with analyzing clustered or repeated measures data, some researchers and analysts turn to mixed effect modeling. Yet others when faced with the same situation turn to fixed effect modeling. Which one you choose is usually dictated by your field of study and statistical education. Those coming from fields like Psychology, Ecology, and Education often choose mixed effect modeling, while those coming from fields like Economics and Political Science typically choose fixed effect modeling. If you learned only one of these approaches, the other may seem mysterious or even deficient. In this article we aim to explain the similarities and differences between these two methods and hopefully shed some light on whichever approach is unfamiliar to you.

Before we get started, let’s load some R packages we’ll use in this article. If you don’t have one or more these packages, you can install them using the install.packages() function. We’ll explain the purpose of each package as we use their functions.


library(lme4)
library(fixest)
library(ggplot2)
library(dplyr)
library(nlme)  # included with base R

Mixed Effect Models

Mixed effect models, also known as multilevel models, contain a mix of what are called “fixed” and “random” effects. Let’s simulate data using a mixed effect model to illustrate what we mean by “fixed” and “random” effects. Below we create two variables, g and x, and then create all combinations of these variables using the expand.grid() function. The x variable is a predictor variable that is simply the numbers 1 - 10. The group variable indicates clusters. We make it a factor to explicitly declare this variable is categorical, not numeric. There are seven clusters total, each with the variable x. In this example, each cluster has ten observations for simplicity, but mixed effect models do not require that cluster sizes be equal.


g <- 1:7
x <- 1:10
d <- expand.grid(x = x, group = factor(g))
head(d, n = 12)

    x group
1   1     1
2   2     1
3   3     1
4   4     1
5   5     1
6   6     1
7   7     1
8   8     1
9   9     1
10 10     1
11  1     2
12  2     2

Next we’ll simulate an outcome variable, y, using an intercept and a slope multiplied by x. The intercept is “beta0” and the slope is “beta1”. We arbitrarily picked 1.5 and 3.2 as the values. Notice y is the same in all seven clusters, which we show by printing y in a 10 x 7 matrix.


beta0 <- 1.5
beta1 <- 3.2
y <- beta0 + beta1*d$x 
matrix(y, ncol = 7, dimnames = list(obs = 1:10, cluster = 1:7))

    cluster
obs     1    2    3    4    5    6    7
  1   4.7  4.7  4.7  4.7  4.7  4.7  4.7
  2   7.9  7.9  7.9  7.9  7.9  7.9  7.9
  3  11.1 11.1 11.1 11.1 11.1 11.1 11.1
  4  14.3 14.3 14.3 14.3 14.3 14.3 14.3
  5  17.5 17.5 17.5 17.5 17.5 17.5 17.5
  6  20.7 20.7 20.7 20.7 20.7 20.7 20.7
  7  23.9 23.9 23.9 23.9 23.9 23.9 23.9
  8  27.1 27.1 27.1 27.1 27.1 27.1 27.1
  9  30.3 30.3 30.3 30.3 30.3 30.3 30.3
  10 33.5 33.5 33.5 33.5 33.5 33.5 33.5

Having our outcome be the same in each group is not very interesting, so let’s add some noise, or random variation. We’ll create two sources of variation: (1) variation between groups, u, and (2) variation within groups, e. Both are generated by randomly sampling from normal distributions with a mean of 0 and a constant standard deviation (that we arbitrarily set to 4.5 and 1.2, respectively). Notice we sample seven values for u and 70 values for e. That’s because u is specific to each group while e is specific to each observation. We then add the group variation, u, to beta0 and the observation variation, e, to each observation. The set.seed() function allows readers to simulate the same random data. We use the head() function to view the first 12 records of data, but the interested reader may want to view the entire data frame to verify that y now has variation between and within clusters.


set.seed(1)
u <- rnorm(n = 7, mean = 0, sd = 4.5)
e <- rnorm(n = nrow(d), mean = 0, sd = 1.2)
d$y <- (beta0 + u[d$group]) + beta1*d$x + e
head(d, n = 12)

    x group         y
1   1     1  2.766947
2   2     1  5.771895
3   3     1  7.914492
4   4     1 13.295095
5   5     1 15.148770
6   6     1 17.135469
7   7     1 18.423318
8   8     1 25.630875
9   9     1 27.427038
10 10     1 30.661530
11  1     2  6.658998
12  2     2  9.711860

Let’s use ggplot2 to plot the data. We use geom_smooth(method = "lm", se = FALSE) to draw fitted straight trend lines for each group.


ggplot(d) +
  aes(x = x, y = y, color = group) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
Scatterplot of y versus x with points and trend lines colored according to group.

In the plot we can see each group has a different intercept but a similar slope. The group exerts a random effect on the intercept while the effect of x, the slope, appears to be fixed. In other words, we have a mix of fixed and random effects. We might think of the intercept as a baseline measure at the beginning of a study that differs between subjects and the slope as the effect of time that is the same between subjects.

To analyze this data using a mixed effect model, we can use the lmer() function from the lme4 package. Since we simulated the data we know the correct model to fit. The model syntax y ~ x + (1|group) says, “model y as a function of x and let the intercept (1) be conditional on group”. This is sometimes called a random intercept model.


m <- lmer(y ~ x + (1|group), data = d)
summary(m, corr = FALSE)

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | group)
   Data: d

REML criterion at convergence: 245.7

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.4142 -0.5818  0.1217  0.5300  2.4361 

Random effects:
 Groups   Name        Variance Std.Dev.
 group    (Intercept) 16.730   4.090   
 Residual              1.202   1.096   
Number of obs: 70, groups:  group, 7

Fixed effects:
            Estimate Std. Error t value
(Intercept)  2.47175    1.57165   1.573
x            3.08703    0.04562  67.675

Under Random Effects we see the two estimates of variation: (1) between group, about 4.1, and (2) within group, about 1.1. Under Fixed Effects we see the coefficient estimates. The slope is estimated to be about 3.1, which is close to the true value of 3.2 we used to simulate the data. The intercept is estimated to be 2.4. Of course the intercept is conditional on group, so the intercept reported here is basically an average of all seven group intercepts. To see all the intercepts, use the coef() function.


coef(m)

$group
  (Intercept)        x
1  -0.5395055 3.087032
2   3.2485211 3.087032
3  -1.8631753 3.087032
4   9.4429700 3.087032
5   4.0227277 3.087032
6  -1.5287058 3.087032
7   4.5194219 3.087032

attr(,"class")
[1] "coef.mer"

We have essentially fit a separate model to each group. The slope coefficient is fixed, but the intercept changes based on the group. In this data, each group has a random effect on the intercept. It’s important to clarify that the model did not estimate a separate intercept for each group but rather estimated each subject’s random effect on the intercept.

What we just demonstrated is the simplest mixed effect model. We could make the model more complex by allowing the groups to also have a random effect on the slope. This would mean having two random effects. We could also model correlation between these two random effects. In addition we could have a more complicated grouping structure. For example, we might observe repeated measures on rats which belong to different litters. The rats and the litters would each form two groups or clusters, where one of the groups (rats) is nested within the other group (litters).

Formally, we can express mixed effect models as

\[\mathbf{y}_j = \mathbf{X}_j\boldsymbol{\beta} + \mathbf{Z}_j\mathbf{u}_j + \boldsymbol{\varepsilon}_j\]

where

\[\mathbf{u}_j \sim \mathcal{N}(\mathbf{0}, \mathbf{G}), \qquad \boldsymbol{\varepsilon}_j \sim \mathcal{N}(\mathbf{0}, \mathbf{R})\]

That’s a lot to process, so let’s break it down piece by piece.

  • \(\mathbf{y_j}\) is the outcome vector for cluster j
  • \(\mathbf{X}_j\) is the predictor matrix for cluster j
  • \(\boldsymbol{\beta}\) is the fixed-effect regression coefficients
  • \(\mathbf{Z}_j\) is the matrix of predictors with random effects in cluster j
  • \(\mathbf{u}_j\) is the vector of random effects for cluster j
  • \(\boldsymbol{\varepsilon}_j\) is the vector of residual errors for cluster j
  • \(\mathbf{u} \sim \mathcal{N}(\mathbf{0}, \mathbf{G})\) says the random effects are drawn from a normal distribution with mean 0 and a covariance matrix \(\mathbf{G}\)
  • \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{R})\) says the residual errors are also drawn from a normal distribution with mean 0 and a covariance matrix \(\mathbf{R}\)

Let’s redo our data simulation above using this model. Notice we do it one cluster at a time since the model is subscripted by j. We use the replicate() function to simulate 7 distinct clusters of data and then combine the data frames using the bind_rows() function from the dplyr package. Notice we also use the model.matrix() function to easily create a predictor matrix with a column of 1s for the intercept. To generate the outcome, y, we need to perform matrix multiplication using the %*% operator.


set.seed(2)
d_mem <- replicate(n = 7, expr = {
  x <- 1:10
  X <- model.matrix(~ x)
  B <- c(1.5, 3.2)
  Z <- matrix(1, nrow = 10)
  u <- rnorm(n = 1, mean = 0, sd = 4.5)
  e <- rnorm(n = 10, mean = 0, sd = 1.2)
  y <- X %*% B + Z %*% u + e
  data.frame(y, x)}, 
  simplify = FALSE)

d_mem <- dplyr::bind_rows(d_mem, .id = "group") 
head(d_mem, n = 12)

        group          y  x
1...1       1  0.8857036  1
2...2       1  5.7692989  2
3...3       1  5.7074337  3
4...4       1 10.1675824  4
5...5       1 13.6227889  5
6...6       1 17.5134302  6
7...7       1 19.5762469  7
8...8       1 25.4452533  8
9...9       1 26.0973401  9
10...10     1 29.9650654 10
1...11      2  8.6466531  1
2...12      2 11.0702847  2

This method of simulating data allows us to easily scale up to more predictors and more complex models, which we’ll do later in this article.

Now let’s look at fixed effects model.

Fixed Effect Models

Fixed effect models are somewhat similar to mixed effect models in that they estimate a different intercept for each cluster or group. However, whereas mixed effect models use random effects to generate group-specific intercepts, fixed effect models actually estimate a separate intercept for each group. In addition, they do not estimate the variation between the intercepts nor do they return a single “average” intercept the way mixed effect models do. They also refer to the clusters as “fixed effects”, which can be confusing if you’re accustomed to mixed effect models. Let’s simulate data using a fixed effect model to see how it compares to a mixed effect model.

We’ll begin the same way as before by creating two variables, g and x, and then generating all combinations of these variables using the expand.grid() function. The x variable is once again a predictor variable that is simply the numbers 1 - 10. The group variable indicates clusters. There are seven clusters total, each with the variable x. In this example, each cluster has ten observations for simplicity, but fixed effect models do not require that cluster sizes be equal.


g <- 1:7
x <- 1:10
d <- expand.grid(x = x, group = factor(g))

Next we’ll simulate the outcome variable, y, using an intercept and a slope multiplied by x. The difference this time is that we use seven distinct intercepts. We randomly simulate the intercepts by sampling from a normal distribution, however it’s important to note that fixed effect models do not assume intercepts are drawn from any particular distribution. To use the intercepts to simulate y for each cluster we need to use each intercept 10 times, so we use the rep() function to repeat them 10 times. Finally we add some random variation to each observation by randomly sampling from a normal distribution with a mean of 0 and a constant standard deviation of 1.2.


set.seed(2)
beta0 <- rep(rnorm(n = 7, mean = 0, sd = 4.5), each = 10)
beta1 <- 3.2
e <- rnorm(n = nrow(d), mean = 0, sd = 1.2)
d$y <- beta0 + beta1*d$x + e
head(d, n = 12)

    x group         y
1   1     1 -1.123753
2   2     1  4.745253
3   3     1  5.397340
4   4     1  9.265065
5   5     1 13.141988
6   6     1 14.692650
7   7     1 17.116282
8   8     1 23.702559
9   9     1 21.990602
10 10     1 29.018210
11  1     2  4.074789
12  2     2  8.447216

Let’s again use ggplot2 to plot the data. We use geom_smooth(method = "lm", se = FALSE) to draw fitted straight trend lines for each group.


ggplot(d) +
  aes(x = x, y = y, color = group) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
Scatterplot of y versus x with points and trend lines colored according to group.

This looks much like the data we simulated using a mixed effect model. Each cluster has a different intercept but similar slope.

To analyze this data using a fixed effect model, we can use the feols() function from the fixest package. The model syntax y ~ x | group says, “model y as a function of x with group as the fixed effect.”


m2 <- feols(y ~ x | group, data = d)
summary(m2)

OLS estimation, Dep. Var.: y
Observations: 70
Fixed-effects: group: 7
Standard-errors: IID 
  Estimate Std. Error t value  Pr(>|t|)    
x  3.13331   0.059198 52.9293 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.33885     Adj. R2: 0.979512
                Within R2: 0.978348

The only coefficient reported is the slope, which is close to the true value of 3.2 we used to generate the data. If we want to see the estimated intercepts, we can view them using the fixef() function.


fixef(m2)

$group
         1          2          3          4          5          6          7 
-3.4386078  1.6721802  7.4205012 -4.8016470 -0.3386059  1.0985188  3.4031017 

attr(,"class")
[1] "fixest.fixef" "list"        
attr(,"exponential")
[1] FALSE

We can also use the built-in lm() function to run a fixed effect model. We just need to include the group variable as a predictor. The intercept in the summary output is the intercept for group 1. The other group coefficients are differences from group 1. For example, the intercept for group 2 is about -3.44 + 5.11 = 1.67.


m2_lm <- lm(y ~ x + group, data = d)
summary(m2_lm)


Call:
lm(formula = y ~ x + group, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1822 -0.8215 -0.0875  0.8609  2.7905 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.4386     0.5553  -6.192 5.26e-08 ***
x             3.1333     0.0592  52.929  < 2e-16 ***
group2        5.1108     0.6362   8.033 3.47e-11 ***
group3       10.8591     0.6362  17.069  < 2e-16 ***
group4       -1.3630     0.6362  -2.142   0.0361 *  
group5        3.1000     0.6362   4.873 7.97e-06 ***
group6        4.5371     0.6362   7.132 1.27e-09 ***
group7        6.8417     0.6362  10.754 8.35e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.423 on 62 degrees of freedom
Multiple R-squared:  0.9816,    Adjusted R-squared:  0.9795 
F-statistic: 472.3 on 7 and 62 DF,  p-value: < 2.2e-16

If we wanted to see the estimated group intercepts in the output, we could fit the model without a single dedicated “intercept”, like so:


m2_lm0 <- lm(y ~ -1 + x + group, data = d)
summary(m2_lm0)


Call:
lm(formula = y ~ -1 + x + group, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1822 -0.8215 -0.0875  0.8609  2.7905 

Coefficients:
       Estimate Std. Error t value Pr(>|t|)    
x        3.1333     0.0592  52.929  < 2e-16 ***
group1  -3.4386     0.5553  -6.192 5.26e-08 ***
group2   1.6722     0.5553   3.011  0.00376 ** 
group3   7.4205     0.5553  13.362  < 2e-16 ***
group4  -4.8017     0.5553  -8.647 3.01e-12 ***
group5  -0.3386     0.5553  -0.610  0.54426    
group6   1.0985     0.5553   1.978  0.05236 .  
group7   3.4031     0.5553   6.128 6.75e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.423 on 62 degrees of freedom
Multiple R-squared:  0.9957,    Adjusted R-squared:  0.9952 
F-statistic:  1806 on 8 and 62 DF,  p-value: < 2.2e-16

But it’s rare that researchers are interested in knowing the group/cluster intercepts. The idea of fixed effect modeling is to simply adjust for the clusters. Unlike mixed effect modeling, which assumes cluster random effects are normally distributed, fixed effect modeling makes no assumption about the variation between clusters. There is also no limit on the number of fixed effects (i.e., cluster variables). It’s not uncommon to see fixed effect models with two or three cluster variables.

Formally, we can express fixed effect models as

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{C}\mathbf{\alpha} + \boldsymbol{\varepsilon}\]

where

\[\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{\sigma})\]

Once again let’s break this down piece by piece.

  • \(\mathbf{y}\) is the outcome vector
  • \(\mathbf{X}\) is the predictor matrix
  • \(\boldsymbol{\beta}\) is the regression coefficients
  • \(\mathbf{C}\) is the matrix of dummy codes indicating cluster membership
  • \(\mathbf{\alpha}\) is the vector of intercepts
  • \(\boldsymbol{\varepsilon}\) is the vector of residual errors
  • \(\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{\sigma})\) says the residual errors are drawn from a normal distribution with mean 0 and a constant standard deviation.

This model is not subscripted by j since the cluster grouping is performed with the \(\mathbf{C}\) matrix.

Let’s redo our data simulation above using this model. We set j = 7 to indicate seven clusters. For B we only specify the slope coefficient since the intercepts are defined in the \(\mathbf{C}\mathbf{\alpha}\) portion of the model. The model.matrix() quickly creates a matrix of dummy codes when it is given a factor variable. But notice we add -1 to the syntax to leave out the column of ones for a single intercept. The intercepts are indicated in the \(\mathbf{C}\) matrix. We generate the intercept coefficients using a normal distribution but there is no assumption or requirement that intercepts are normally distributed.


set.seed(2)
j <- 7
x <- rep(1:10, j)
X <- model.matrix(~ -1 + x)
B <- 3.2
group <- factor(rep(1:j, each = 10))
C <- model.matrix(~ group)
alpha <- rnorm(j, mean = 0, sd = 4.5)  # the intercepts
e <- rnorm(10 * j, mean = 0, sd = 1.2)
y <- X %*% B + C %*% alpha + e
d_fem <- data.frame(y, x, group)
head(d_fem, n = 12)

             y  x group
1  -1.12375309  1     1
2   4.74525326  2     1
3   5.39734013  3     1
4   9.26506544  4     1
5  13.14198787  5     1
6  14.69265011  6     1
7  17.11628177  7     1
8  23.70255929  8     1
9  21.99060164  9     1
10 29.01821004 10     1
11  0.03867393  1     2
12  4.41110030  2     2

As before, this notation allows us to easily scale up the data simulation process to accommodate more complex models.

Omitted Confounders

In the simple example above, it may like seem the difference between fixed effect and mixed effect models is slight. After all, they both arrived at the same basic conclusion that the slope coefficient was about 3.1. However, there are some important differences to be aware of. Probably the most important one is how each handles omitted confounders.

An omitted confounder is an unobserved variable that is correlated with one of the observed variables. This is sometimes called endogeneity. In a mixed effect model, an omitted confounder can lead to biased (wrong) estimates of coefficients. However, this is not a problem for a fixed effect model. Let’s demonstrate.

Below we write a function to simulate a cluster of data using a mixed effect model. We set the default cluster size to ten observations. The crucial part of the function is the first two lines. The variable z is a “level 2” variable. That means it’s constant within a cluster. The variable x1 is a “level 1” variable. That means it varies within a cluster. The second line shows the variable x1 being generated as a function of z. This induces correlation between z and x1. The z variable will serve as our unobserved confounder. The B matrix contains the true values of the coefficients. The order is intercept, x1, x2 and z. Make note of the x1 coefficient: 0.8. This is the coefficient we investigate in this example. After we define the function, we run it to demonstrate how it works.


sim_data <- function(m = 10){
  z <- rnorm(1, mean = 10, sd = 2) # add level 2 variable
  x1 <- rnorm(m) + z               # z is confounder
  x2 <- floor(runif(m, min = 1, max = 6))
  X <- model.matrix(~ x1 + x2)
  X <- cbind(X, z) 
  B <- matrix(data = c(0.5, 0.8, -0.4, 1.4), ncol = 1)
  Z <- matrix(data = 1, nrow = m, ncol = 1)
  u <- rnorm(1, mean = 0, sd = 1.2)
  e <- rnorm(m, mean = 0, sd = 0.8)
  y <- X %*% B + Z %*% u + e
  data.frame(y, x1, x2, z)
}
sim_data()

          y        x1 x2        z
1  23.09761 12.596105  2 11.54358
2  21.73521 10.133541  1 11.54358
3  25.03000 12.539564  4 11.54358
4  21.71523  9.847815  3 11.54358
5  21.16494 11.010207  5 11.54358
6  21.48698 10.171310  3 11.54358
7  21.35896  9.335660  5 11.54358
8  24.33190 13.365702  4 11.54358
9  23.09159 10.890186  1 11.54358
10 21.43828 11.258898  5 11.54358

Now we’ll run the function 30 times to generate 30 clusters of data and then combine the 30 data sets into one data frame.


set.seed(234)
d2 <- replicate(n = 30, sim_data(), simplify = FALSE)
d2 <- dplyr::bind_rows(d2, .id = "group")

The pairs() function reveals a fair amount of correlation between x1 and z.


pairs(d2[,c("y", "x1", "x2", "z")])
Pairwise scatterplots of y, x1, x2, and z.

Now let’s pretend we didn’t observe or collect z and model y as a function of just x1 and x2 with random effects on the intercept. Again, x1 is our primary variable of interest. We might think of x2 as a variable we wish to adjust for in our model. We can view the estimated coefficient for x1 using the fixef() function.


m3 <- lmer(y ~ x1 + x2 + (1|group), data = d2)
fixef(m3)["x1"]

       x1 
0.8216403 

The estimated coefficient is a bit higher than the true value of 0.8 we used to simulate the data. That’s not too concerning. We wouldn’t expect our model to exactly estimate the true value. But what happens if we repeat this process many times? Below we replicate the data simulation and mixed effect model 250 times and then plot a histogram of the estimated coefficients for x1. We also add a red line to indicate the true coefficient value for x1.


x1_coefs <- replicate(n = 250, expr = {
  d2 <- replicate(n = 30, sim_data(), simplify = FALSE)
  d2 <- dplyr::bind_rows(d2, .id = "group")
  m3 <- lmer(y ~ x1 + x2 + (1|group), data = d2)
  fixef(m3)["x1"]
  })
hist(x1_coefs, 
     main = "Distribution of estimated x1 coefficients\nMixed Effect model, omitted confounder", 
     xlab = "x1")
abline(v = 0.8, col = "red", lwd = 2)
Histogram of estimated x1 coefficients from mixed effect models with omitted confounder. The distribution is skewed about the true value of x1, illustrating bias.

The histogram reveals that we are systematically overestimating the x1 coefficient. In other words, the estimate is biased. If the estimate was unbiased, we would expect to see the histogram centered on the true value of 0.8.

Let’s run the same simulation using a fixed effect model. Notice we use feols() to model y as a function of x1 and x2 with group fixed effects.


fem_x1_coefs <- replicate(n = 250, expr = {
  d2 <- replicate(n = 30, sim_data(), simplify = FALSE)
  d2 <- dplyr::bind_rows(d2, .id = "group")
  m4 <- feols(y ~ x1 + x2 | group, data = d2)
  coef(m4)["x1"]
})
hist(fem_x1_coefs, 
     main = "Distribution of estimated x1 coefficients\nFixed Effect model, omitted confounder", 
     xlab = "x1")
abline(v = 0.8, col = "red", lwd = 2)
Histogram of estimated x1 coefficients from fixed effect models with omitted confounder. The distribution is symmetric about the true value of x1.

The histogram reveals the fixed effect model making unbiased estimates of the x1 coefficient despite the omitted confounder. That’s impressive! This might suggest always using fixed effect models for clustered data.

But let’s say we did observe z and that we want to include it in our model. This is not a problem for a mixed effect model. We simply add it to our model formula and our x1 estimate is no longer biased.


x1_coefs2 <- replicate(n = 250, expr = {
  d2 <- replicate(n = 30, sim_data(), simplify = FALSE)
  d2 <- dplyr::bind_rows(d2, .id = "group")
  m3 <- lmer(y ~ x1 + x2 + z + (1|group), data = d2)
  fixef(m3)["x1"]
  })
hist(x1_coefs2, main = "Distribution of estimated x1 coefficients\nMixed Effect model, confounder included", xlab = "x1")
abline(v = 0.8, col = "red", lwd = 2)
Histogram of estimated x1 coefficients from mixed effect models with confounder included. The distribution is symmetric about the true value of x1.

In addition we can estimate a coefficient for z that allows us to quantify its association with our outcome.


m3 <- lmer(y ~ x1 + x2 + z + (1|group), data = d2)
fixef(m3)["z"]

       z 
1.468052 

But adding z to the fixed effect model poses a problem.


m4 <- feols(y ~ x1 + x2 + z | group, data = d2)

The variable 'z' has been removed because of collinearity (see $collin.var).

The z variable is dropped because fixed effect models do not allow for level 2 predictors. Recall that the z variable is constant within each cluster. By including z we induced perfect collinearity with the intercept in each cluster. In general, fixed effect models don’t allow us to quantify the association of level 2 variables with our outcome.

The Within-Between Specification

McNeish and Kelley (2018) outline a method for mixed effect models to address the issue of omitted confounders which they call the Within-Between Specification. The basic idea is to preprocess the data as follows:

  • Group mean center level 1 predictors (i.e., predictors that vary within clusters)
  • Calculate the mean of level 1 predictors within each cluster

Then model your outcome as a function of these values instead of using the raw level 1 data. The coefficients on the group-mean-centered predictors will capture within-cluster effects while the coefficients on the group means will capture between-cluster effects. This basically relegates any effects due to omitted variables at level 2 to another coefficient (i.e., the coefficient for the group means).

Let’s demonstrate this approach using the data frame d2 that we created above. Recall this data frame has two variables, x1 and z, that are correlated. When we leave z out of our model, we have an omitted confounder. To implement the Within-Between Specification, we’ll use the dplyr package to (1) group mean center the x1 and x2 variables, and (2) calculate the mean of x1 and x2 within each cluster.


d2_wb <- d2 |> 
  group_by(group) |> 
  mutate(x1m = mean(x1),  # mean within each cluster
         x2m = mean(x2),
         x1c = x1 - x1m,  # group mean center within each cluster
         x2c = x2 - x2m,
         ) |> 
  ungroup() 

Then we fit our model using the newly derived variables. Our variable of interest is the x1c variable, the centered version of x1. In our previous model, the variable x1 was biased due to the omitted confounder z. But by modeling the group mean centered version of this variable along with its cluster mean, we can get an unbiased estimate of this coefficient.


m_wb <- lmer(y ~ x1c + x1m + x2c + x2m + (1|group), data = d2_wb)
fixef(m_wb)["x1c"]

      x1c 
0.7824866 

This estimate looks a little low, but let’s see how the estimate performs with repeated sampling and modeling.


x1_coefs_wb <- replicate(n = 250, expr = {
  d2 <- replicate(n = 30, sim_data(), simplify = FALSE)
  d2 <- dplyr::bind_rows(d2, .id = "group") |> 
    group_by(group) |> 
    mutate(x1m = mean(x1),  # mean within each cluster
           x2m = mean(x2),
           x1c = x1 - x1m,  # group mean center within each cluster
           x2c = x2 - x2m
           ) |> 
    ungroup() 
  m_wb <- lmer(y ~ x1c + x1m + x2c + x2m + (1|group), data = d2)
  fixef(m_wb)["x1c"]
})
hist(x1_coefs_wb, main = "Distribution of estimated x1c coefficients\nWithin-Between Mixed Effect model", 
     xlab = "x1c")
abline(v = 0.8, col = "red", lwd = 2)
Histogram of estimated x1c coefficients from within-between mixed effect models. The distribution is symmetric about the true value of x1.

Despite the omitted confounder we’re now making unbiased estimates for the x1 coefficient using a mixed effect model. Of course this procedure assumes we have specified a correct model and requires estimating more parameters. There is no free lunch.

Clustered Standard Errors for Fixed Effect Models

Let’s look again at the output of the first fixed effect model we fit at the beginning of the article using the fixest package.


m2

OLS estimation, Dep. Var.: y
Observations: 70
Fixed-effects: group: 7
Standard-errors: IID 
  Estimate Std. Error t value  Pr(>|t|)    
x  3.13331   0.059198 52.9293 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.33885     Adj. R2: 0.979512
                Within R2: 0.978348

Notice the line that says, Standard-errors: IID. This tells us the standard errors were calculated assuming the noise (or error) in our model was independent and identically distributed (IID). This was indeed the case for the data we simulated. Within and between all clusters, the error was independently sampled from a single normal distribution with a fixed standard deviation. Put another way, the errors associated with all observations were identical to one another in that they were drawn from the same probability distribution.

The standard error calculated for x was 0.059. This summarizes the uncertainty in our estimated coefficient. We estimate x to be about 3.133 give or take 0.059. If we add and subtract two standard errors we can get an approximate 95% confidence interval.


3.113 + c(-1,1)*2*0.059

[1] 2.995 3.231

In real life data, the IID assumption may not be plausible. For example, if we have observations collected over time within our clusters, we might imagine those observations having correlated errors. In that case the IID assumption would be violated and the estimated standard errors would be too small. Therefore you will often see fixed effect regression performed with clustered standard errors.

“Clustered standard errors” is a catch-all phrase for a variety of methods to adjust standard errors. The IID assumption can be violated in multiple ways.

  • errors can be heteroskedastic. That means errors are related to variables in the model or different between clusters.
  • errors can be correlated within clusters. For data collected over time, that can mean observations closer together may be more similar than observations further apart.
  • errors can be both heteroskedastic and correlated.
  • errors can be spatially correlated. That means observations closer together in space may be more similar than observations further from one another.

Let’s demonstrate heteroskedastic errors. Below we generate seven clusters of data containing 50 observations each. When we simulate error for each group, we do so with increasing levels of variance. Notice the vector we create called gsd (group standard deviation) which is simply the numbers 3 - 9, each repeated 50 times. These will serve as increasing values of standard deviation for the normally distributed errors.


set.seed(4)
j <- 7
n <- 50
x <- rep(rnorm(n = n), j)
X <- model.matrix(~ -1 + x)
B <- 3.2
group <- factor(rep(1:j, each = n))
C <- model.matrix(~ group)
alpha <- rnorm(j, mean = 0, sd = 4.5)  # the intercepts
gsd <- rep(3:9, each = n) # error increasing by group
e <- rnorm(n * j, mean = 0, sd = gsd)
y <- X %*% B + C %*% alpha + e
d_fem_hetero <- data.frame(y, x, group)

Now let’s plot the data and facet by group. Notice the variation increases by group. We have obvious heteroskedastic errors.


ggplot(d_fem_hetero) +
  aes(x = x, y = y) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~group)
Scatterplots of y versus x faceted by group, with linear trend lines through points. The scatter about the line increases gradually for each group.

If we fit a fixed effect model using feols() with the default settings, we’ll get IID standard errors. This results in a standard error for the x coefficient that is too small.


m_iid <- feols(y ~ x | group, data = d_fem_hetero)
m_iid

OLS estimation, Dep. Var.: y
Observations: 350
Fixed-effects: group: 7
Standard-errors: IID 
  Estimate Std. Error t value   Pr(>|t|)    
x  3.08007    0.36544 8.42838 9.8907e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 6.143     Adj. R2: 0.371161
              Within R2: 0.171988

To adjust the standard error estimate for heteroskedasticity we set the vcov argument to “hetero”.


m_hetero <- feols(y ~ x | group, data = d_fem_hetero, vcov = "hetero")
m_hetero

OLS estimation, Dep. Var.: y
Observations: 350
Fixed-effects: group: 7
Standard-errors: Heteroskedasticity-robust 
  Estimate Std. Error t value   Pr(>|t|)    
x  3.08007   0.400391 7.69265 1.5532e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 6.143     Adj. R2: 0.371161
              Within R2: 0.171988

Now the standard error is a bit higher (0.40 vs 0.37) reflecting the increased uncertainty in our estimated coefficient due to the heteroskedasticity.

We can also simulate clustered data with correlated errors. Below we modify our data generation code to simulate errors from an autoregressive (AR) process using the base R function arima.sim(). The argument order = c(1,0,0) specifies an AR process of order 1. The argument ar = 0.7 specifies the coefficient in the process. (We usually symbolize this coefficient with \(\phi\).) This says each error is a function of the previous error multiplied by 0.7 plus some random noise (\(a_t\)) drawn from a normal distribution with mean 0 and standard deviation 2.5 (sd=2.5):

\[\epsilon_t = 0.7\epsilon_{t-1} + a_t \]

Notice we replicate the correlated error for each group using the rep() function.


set.seed(5)
j <- 7
n <- 50
x <- rep(seq(0, 6, length.out = n), j)
X <- model.matrix(~ -1 + x)
B <- 3.2
group <- factor(rep(1:j, each = n))
C <- model.matrix(~ group)
alpha <- rnorm(j, mean = 0, sd = 4.5)  # the intercepts
e <- as.vector(replicate(n = j, 
               arima.sim(list(order = c(1,0,0), ar = 0.7), 
                         n = n, 
                         sd = 2.5)))
y <- X %*% B + C %*% alpha + e
d_fem_corr <- data.frame(y, x, group)

When we plot the raw data we can see the correlation exhibited by the way the data snakes its way around the trend line.


ggplot(d_fem_corr) +
  aes(x = x, y = y) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~group)
Scatterplots of y versus x faceted by group, with linear trend lines through points. The scatter about the line exhibits serial correlation that snakes around the trend line.

If we neglect this feature of the data and fit a fixed effect model assuming IID errors, the standard error for the x coefficient will be too small.


m_corr <- feols(y ~ x | group, data = d_fem_corr)
m_corr

OLS estimation, Dep. Var.: y
Observations: 350
Fixed-effects: group: 7
Standard-errors: IID 
  Estimate Std. Error t value  Pr(>|t|)    
x  2.49217   0.106917 23.3094 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.49388     Adj. R2: 0.742396
                Within R2: 0.613702

To address the correlated error we need to specify vcov = "cluster".


m_corr <- feols(y ~ x | group, data = d_fem_corr, vcov = "cluster")
m_corr

OLS estimation, Dep. Var.: y
Observations: 350
Fixed-effects: group: 7
Standard-errors: Clustered (group) 
  Estimate Std. Error t value   Pr(>|t|)    
x  2.49217   0.359792  6.9267 0.00044835 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 3.49388     Adj. R2: 0.742396
                Within R2: 0.613702

Now the standard error for x is much larger (0.36 vs 0.11) reflecting the uncertainty due to the correlated errors.

It’s worth pointing out that not all problems with heteroskedasticity and serial correlation necessarily require clustered standard errors. It’s possible to directly model non-constant variance and the correlation structure of residuals. Let’s briefly demonstrate both of these approaches.

To model heteroskedasticity we can use the gls() function from the nlme package. Working again with the “d_fem_hetero” data frame, we model y as a function of x and group, just as we would with a fixed effects model. But notice the weights argument. We use this to specify a model for the residual standard error. The syntax varIdent(form = ~ 1|group) says, “model the residual standard error within each group.” Note that is precisely how we generated the heteroskedasticity in our errors.


m_hetero2 <- gls(y ~ group + x, data = d_fem_hetero,
                 weights = varIdent(form = ~ 1|group))
summary(m_hetero2)

Generalized least squares fit by REML
  Model: y ~ group + x 
  Data: d_fem_hetero 
       AIC      BIC    logLik
  2192.662 2250.184 -1081.331

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | group 
 Parameter estimates:
       1        2        3        4        5        6        7 
1.000000 1.505008 2.074188 1.863696 2.845122 2.480032 3.549500 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept) -3.028291 0.3830926 -7.904855  0.0000
group2      -2.663760 0.6825108 -3.902883  0.0001
group3      -1.152987 0.8697514 -1.325651  0.1858
group4       1.941289 0.7988806  2.430011  0.0156
group5       8.679725 1.1390937  7.619851  0.0000
group6      -2.903366 1.0100313 -2.874531  0.0043
group7      -3.699670 1.3928921 -2.656106  0.0083
x            2.866368 0.2723858 10.523192  0.0000

...(output omitted)

Standardized residuals:
         Min           Q1          Med           Q3          Max 
-2.772302028 -0.682716699  0.005596002  0.688452501  2.553382964 

Residual standard error: 2.670852 
Degrees of freedom: 350 total; 342 residual

First of all notice the standard error of x remains lower at 0.27. In addition, notice that the residual standard error is estimated to be about 2.7 (bottom of output). But in the “Variance function” section we see variance parameter estimates for the seven groups. The first estimate is 1, which implies the residual standard error in group 1 is 2.7 * 1 = 2.7. For group 2, the estimated residual standard error is about 2.7 * 1.5 = 4.05. For group 3, the estimated residual standard error is about 2.7 * 2.1 = 5.67. And so on. This model for the variance should make sense since we simply used the numbers 3 - 9 to gradually increase the standard deviation within each group. This implies the variance within each group increases by the following factors:


4:9/3

[1] 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000

It is these factors that the variance model is estimating. We can use the intervals() function to calculate 95% confidence intervals for these estimates and see that the true factor values fall comfortably within the margin of error.


ints1 <- intervals(m_hetero2)
ints1$varStruct

     lower     est.    upper
2 1.136822 1.505008 1.992438
3 1.562462 2.074188 2.753511
4 1.407864 1.863696 2.467116
5 2.148052 2.845122 3.768401
6 1.872009 2.480032 3.285540
7 2.680030 3.549500 4.701048
attr(,"label")
[1] "Variance function:"

We can also use the gls() function to model serial correlation. Working again with the “d_fem_corr” data frame, we model y as a function of x and group, just as we would with a fixed effects model. But notice the correlation argument. We use this to specify a model for the correlated residuals. The syntax corAR1(form = ~1|group) says, “model the correlated residuals within each group as an autocorrelation structure of order 1.” Note that is precisely how we generated the serial correlation in our errors.


m_corr2 <- gls(y ~ group + x, data = d_fem_corr, 
               correlation = corAR1(form = ~ 1 | group))
summary(m_corr2, corr = F)

Generalized least squares fit by REML
  Model: y ~ group + x 
  Data: d_fem_corr 
       AIC      BIC    logLik
  1641.795 1680.143 -810.8974

Correlation Structure: AR(1)
 Formula: ~1 | group 
 Parameter estimate(s):
      Phi 
0.7594353 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept) -1.750367 1.6076034 -1.088805  0.2770
group2       6.102365 1.9748768  3.089998  0.0022
group3      -5.298592 1.9748768 -2.682999  0.0077
group4      -2.279637 1.9748768 -1.154318  0.2492
group5       6.183463 1.9748768  3.131063  0.0019
group6      -2.853353 1.9748768 -1.444826  0.1494
group7      -1.901256 1.9748768 -0.962722  0.3364
x            2.703868 0.2654806 10.184805  0.0000

...(output omitted)

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-3.13528615 -0.57798234  0.01363745  0.53826650  2.44047434 

Residual standard error: 3.87491 
Degrees of freedom: 350 total; 342 residual

First, notice the standard error on x is lower than the clustered standard error (0.27 vs 0.36). But also notice under the “Correlation Structure” section we see an estimate for the parameter phi, 0.795. This is the estimated parameter for phi in an AR(1) process. When we simulated the “d_fem_corr” data above with serial correlation, we set \(\phi = 0.7\). The estimate of 0.795 is pretty close to the true value. When we use the intervals() function to calculate a 95% confidence interval for \(\phi\), the true value of 0.7 falls comfortably within the margin of error.


ints2 <- intervals(m_corr2)
ints2$corStruct

        lower      est.     upper
Phi 0.6642848 0.7594353 0.8303664
attr(,"label")
[1] "Correlation structure:"

It’s important to think carefully about your analysis and reflect on whether you should calculate different standard errors or fit a different model. Directly modeling heteroskedasticity may seem like a good idea, but if we had 100 groups, that would mean estimating an additional 99 parameters. In that case it may make more sense to calculate clustered standard errors. Likewise, calculating clustered standard errors for serial correlation may seem easier than directly modeling the correlation, but a good model for the residual standard errors may result in a lower standard error for a coefficient of interest.

Final comments

There is much more to be said about mixed effect and fixed effect models. This article does not pretend to be a comprehensive overview of either method. The goal was hopefully to help researchers and analysts familiar with one method get acquainted with the other. In general, mixed effect models are probably more appropriate for designed experiments where the clusters are randomly sampled from a larger population and it’s possible to manipulate subjects’ exposure to variables of interest. Fixed effect models are probably more appropriate for observational data where the clusters are non-random and there is no ability to manipulate subjects’ exposure to variables of interest. However, that’s not a strict rule.

See McNeish and Kelley (2018) for a detailed discussion of the differences between mixed effect and fixed effect models. For a deeper dive into fixed effect models from a causal inference perspective, see chapter 16 of Huntington-Klein (2025). The fixest developers also provide a friendly introduction to the fixest package. For a thorough treatment of mixed effect models, see West et al. (2022).

R session details

The analysis was done using the R Statistical language (v4.5.2; R Core Team, 2025) on Windows 11 x64, using the packages lme4 (v2.0.1), Matrix (v1.7.4), fixest (v0.13.2), nlme (v3.1.168), ggplot2 (v4.0.2) and dplyr (v1.2.0).

References


Clay Ford
Statistical Research Consultant
University of Virginia Library
March 10, 2026


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.