Simulating Endogeneity

First off, what is endogeneity, and why would we want to simulate it?

Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means.

Let’s first simulate ideal data for simple linear regression using R.


set.seed(1)
x <- seq(1,5,length.out = 100) # independent variable
e <- rnorm(100,sd = 1.5) # error (uncorrelated with x)
y <- 2 + 2*x + e # dependent variable

We generate x and e independent of one another. Then we generate y using x and e. The set.seed() function allows us to generate the same data in case you want to follow along. Next we can use linear regression to attempt to unravel the relationship between y and x. Since we generated the data, we know y has a linear relationship with x with a slope of 2 and an intercept of 2 plus some random noise from a normal distribution with a standard deviation of 1.5. But our statistical modeling function, lm(), doesn't know this. How does it do?


lm1 <- lm(y ~ x)
coef(lm1)[2] # extract the slope coefficients from the model object, lm1


       x 
1.983255

Not bad! The lm() function used a method called "ordinary least squares" regression to figure out that y increases by about 1.98 for every one unit increase in x. Recall the actual value in the data generation process was 2. This works so well because all of the assumptions for ordinary least squares regression have been met. A Google search will turn up those assumptions, but the big one we're interested in is the assumption that the independent variable(s), x in our generated data, is independent of the error in the model. In other words, no matter what value our predictor variable x takes, we expect the error to be normally distributed with mean 0 and some finite standard deviation that we usually signify with \(\sigma\).

Now let's generate some data where this assumption is not met. This is a little trickier. We create an R function called genEndogData() to simulate the data, which we can then use in other simulations. We provide the code and then describe what it does.


genEndogData <- function(n=100){
  sigma <- matrix(c(40,12,12,20), ncol=2)
  de <- mvtnorm::rmvnorm(n=n, mean=c(0,0), sigma=sigma)
  Z <- seq(1,5,length.out = n)
  X <- 3*Z + de[,1]
  Y <- 2*X + de[,2] 
  data.frame(Y,X,Z)
}

genEndogData() is the name of the function, and it has one argument, n, that allows you to specify how many rows of data you want. The default is 100. The first line defines a matrix we call sigma. This will be our variance-covariance matrix for the next line, which generates data from a multivariate normal distribution. Why do that? Because we want to generate correlated data. In this case we're generating two columns of data in a matrix called de that have positive covariance. We will use these two columns in the following lines to simulate error. (We could add additional arguments to allow different sigma matrices be specified, but we opt to keep the function simple for this post.)

In the next three lines, we define variables Z, X and Y. Z is simply a sequence of 100 numbers from 1 to 5. Then we derive X in the same way we originally derived our "ideal" data using Z. Finally we derive Y just like we did X, except now we use X instead of Z! Notice in each case we use a column from the de matrix as our error, which are correlated. To end the function, we put our variables in a data frame. To use this function with the default n, we just run genEndogData().

What happens now when we regress Y on X using simulated data from genEndogData()? Do we come close to the true coefficient 2?


set.seed(1)
lm2 <- lm(Y ~ X, data=genEndogData())
coef(lm2)[2]


       X 
2.228276

Not as close before. Maybe that was random chance. Instead of trying once, we can try it 1000 times using the replicate() function and see how often we come close to 2:


xout <- replicate(n = 1000,expr = coef(lm(Y ~ X, genEndogData()))[2])
summary(xout)


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.012   2.194   2.229   2.231   2.269   2.393

75% of the data is greater than 2.19. For some reason the estimate is biased. Why is that happening? Endogeneity. The error in our data generation process is correlated with the independent variable. To say it another way, X is endogenous, which means a major assumption of ordinary least squares regression is violated and it no longer works as advertised.

How can we unravel the true relationship? One approach would be to take our cues from the data-generating process. Notice X was generated from Z with error. We could regress X on Z to create a model, and then get fitted values for X. These fitted values of X would be our best guess of X without the error. We could then regress Y on the fitted values of X to get a better sense of the relationship between Y and X. So we do two stages of regression. Let's try it.


set.seed(111)
dat <- genEndogData()
stage1 <- lm(X ~ Z, data=dat) # first stage
dat$fitX <- fitted(stage1) # get fitted values for X
stage2 <- lm(Y ~ fitX, data=dat) # second stage
coef(stage2)[2]


    fitX 
2.034136

Compare to the traditional OLS approach:


coef(lm(Y ~ X, data=dat))[2]


      X 
2.24973

We see our two-stage approach got us much closer to the true slope of 2. In fact there's a method called two-stage least squares that does precisely this. The implementation is different than what was demonstrated above, but the idea is the same. The sem package provides a function called tsls() that performs two-stage least squares. A quick demo:


install.packages("sem") # if you don't already have the sem package
library(sem)
coef(tsls(Y ~ X, ~ Z, data=dat))[2]


       X 
2.034136

This works great with simulated data when we know we have endogeneity. We knew we had this additional variable Z on which we could regress X to get an updated version of X without the error. However in real life it's very hard to find a Z (or as it's formally called, an "instrument"). And that's assuming you've detected the endogeneity! How would you know?

Fortunately, there are statistical tests for endogeneity. One such test is the Durbin–Wu–Hausman test. There are also several tests regarding the strength of your instrument. And now we're stepping into a large and complex domain of econometrics that doesn't neatly fit into a single article! We'll stop here, but hopefully you have a better understanding of what endogeneity means and what causes it.

References

Fox, J., Nie, Z., & Byrnes, J. (2022). sem: Structural equation models (Version 3.1-15). https://CRAN.R-project.org/package=sem
Freedman, D. (2009). Statistical models: theory and practice, revised edition. Cambridge University Press.
R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Clay Ford
Statistical Research Consultant
University of Virginia Library
September 10, 2015
Updated May 23, 2023

For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.