First off, what is endogeneity, and why would we want to simulate it?

Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means!

Let’s first simulate ideal data for simple linear regression using R.

```
set.seed(1)
x <- seq(1,5,length.out = 100) # independent variable
e <- rnorm(100,sd = 1.5) # error (uncorrelated with x)
y <- 2 + 2*x + e # dependent variable
```

We generate `x`

and `e`

independent of one another. Then we generate `y`

using `x`

and `e`

. The `set.seed()`

function allows us to generate the same data in case you want to follow along. Next we can use linear regression to attempt to unravel the relationship between `y`

and `x`

. Since we generated the data, we know `y`

has a linear relationship with `x`

with a slope of 2 and an intercept of 2 plus some random noise from a normal distribution with a standard deviation of 1.5. But our statistical modeling function, `lm()`

, doesn't know this. How does it do?

```
lm1 <- lm(y ~ x)
coef(lm1)[2] # extract the slope coefficients from the model object, lm1
```

```
x
1.983255
```

Not bad! The `lm()`

function used a method called "ordinary least squares" regression to figure out that `y`

increases by about 1.98 for every one unit increase in x. Recall the actual value in the data generation process was 2. This works so well because all of the assumptions for ordinary least squares regression have been met. A Google search will turn up those assumptions, but the big one we're interested in is the assumption that the independent variable(s), `x`

in our generated data, is independent of the error in the model. In other words, no matter what value our predictor variable `x`

takes, we expect the error to be normally distributed with mean 0 and some finite standard deviation that we usually signify with \(\sigma\).

Now let's generate some data where this assumption is *not* met. This is a little trickier. We create an R function called `genEndogData()`

to simulate the data, which we can then use in other simulations. I'll provide the code and then describe what it does.

```
genEndogData <- function(n=100){
sigma <- matrix(c(40,12,12,20), ncol=2)
de <- mvtnorm::rmvnorm(n=n, mean=c(0,0), sigma=sigma)
Z <- seq(1,5,length.out = n)
X <- 3*Z + de[,1]
Y <- 2*X + de[,2]
data.frame(Y,X,Z)
}
```

`genEndogData()`

is the name of the function, and it has one argument, `n`

, that allows you to specify how many rows of data you want. The default is 100. The first line defines a matrix we call `sigma`

. This will be our variance-covariance matrix for the next line, which generates data from a multivariate normal distribution. Why do that? Because we want to generate correlated data. In this case we're generating two columns of data in a matrix called `de`

that have positive covariance. We will use these two columns in the following lines to simulate error. (I could have added additional arguments to allow different `sigma`

matrices be specified, but I opted to keep the function simple for this post.)

In the next three lines, we define variables `Z`

, `X`

and `Y`

. `Z`

is simply a sequence of 100 numbers from 1 to 5. Then we derive `X`

in the same way we originally derived our "ideal" data using `Z`

. Finally we derive `Y`

just like we did `X`

, except now we use `X`

instead of `Z`

! Notice in each case we use a column from the `de`

matrix as our error, which are correlated. To end the function, we put our variables in a data frame. To use this function with the default `n`

, we just run `genEndogData()`

.

What happens now when we regress `Y`

on `X`

using simulated data from `genEndogData()`

? Do we come close to the true coefficient 2?

```
set.seed(1)
lm2 <- lm(Y ~ X, data=genEndogData())
coef(lm2)[2]
```

```
X
2.228276
```

Not as close before. Maybe that was random chance. Instead of trying once, we can try it 1000 times using the `replicate()`

function and see how often we come close to 2:

```
xout <- replicate(n = 1000,expr = coef(lm(Y ~ X, genEndogData()))[2])
summary(xout)
```

```
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.012 2.194 2.229 2.231 2.269 2.393
```

75% of the data is greater than 2.19. For some reason the estimate is biased. Why is that happening? Endogeneity. *The error in our data generation process is correlated with the independent variable*. To say it another way, `X`

is endogenous, which means a major assumption of ordinary least squares regression is violated and it no longer works as advertised.

How can we unravel the true relationship? One approach would be to take our cues from the data-generating process. Notice `X`

was generated from `Z`

with error. We could regress `X`

on `Z`

to create a model, and then get fitted values for `X`

. These fitted values of `X`

would be our best guess of `X`

without the error. We could then regress `Y`

on the fitted values of `X`

to get a better sense of the relationship between `Y`

and `X`

. So we do two stages of regression. Let's try it.

```
set.seed(111)
dat <- genEndogData()
stage1 <- lm(X ~ Z, data=dat) # first stage
dat$fitX <- fitted(stage1) # get fitted values for X
stage2 <- lm(Y ~ fitX, data=dat) # second stage
coef(stage2)[2]
```

```
fitX
2.034136
```

Compare to the traditional OLS approach:

```
coef(lm(Y ~ X, data=dat))[2]
```

```
X
2.24973
```

We see our two-stage approach got us much closer to the true slope of 2. In fact there's a method called two-stage least squares that does precisely this. The implementation is different than what was demonstrated above, but the idea is the same. The sem package provides a function called `tsls()`

that performs two-stage least squares. A quick demo:

```
install.packages("sem") # if you don't already have the sem package
library(sem)
coef(tsls(Y ~ X, ~ Z, data=dat))[2]
```

```
X
2.034136
```

This works great with simulated data when we know we have endogeneity. We knew we had this additional variable `Z`

on which we could regress `X`

to get an updated version of `X`

without the error. However in real life it's very hard to find a Z (or as it's formally called, an "instrument"). And that's assuming you've detected the endogeneity! How would you know?

Fortunately, there are statistical tests for endogeneity. One such test is the Durbin–Wu–Hausman test. There are also several tests regarding the strength of your instrument. And now we're stepping into a large and complex domain of econometrics that doesn't neatly fit into a single article! We'll stop here, but hopefully you have a better understanding of what endogeneity means and what causes it.

## References

- Fox, J., Nie, Z., & Byrnes, J. (2022).
*sem: Structural equation models*(Version 3.1-15). https://CRAN.R-project.org/package=sem - Freedman, D. (2009).
*Statistical models: theory and practice, revised edition*. Cambridge University Press. - R Core Team. (2023).
*R: A language and environment for statistical computing*. R Foundation for Statistical Computing. https://www.R-project.org/

*Clay Ford
Statistical Research Consultant
University of Virginia Library
September 10, 2015
Updated May 23, 2023*

**For questions or clarifications regarding this article, contact statlab@virginia.edu.**

View the entire collection of UVA Library StatLab articles, or learn how to cite.