One of the basic assumptions of linear modeling is constant, or *homogeneous*, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.

Below we create a sorted vector of numbers ranging from 1 to 10 called `x`

, and then create a vector of numbers called `y`

that is a function of `x`

. When we plot `x`

vs `y`

, we get a straight line with an intercept of 1.2 and a slope of 2.1.

```
x <- seq(1,10, length.out = 100)
y <- 1.2 + 2.1 * x
plot(x, y)
```

Now let’s add some “noise” to our data so that `y`

is not completely determined by `x`

. We can do that by randomly drawing values from a theoretical normal distribution with mean 0 and some set variance, and then adding them to the formula that generates `y`

. The `rnorm()`

function in R allows us to easily do this. Below we draw 100 random values from a normal distribution with mean 0 and standard deviation 2 and save as a vector called `noise`

. (Recall that standard deviation is simply the square root of variance.) Then we generate `y`

with the noise added. The function `set.seed(222)`

allows you to get the same “random” data in case you want to follow along. Finally we re-plot the data.

```
set.seed(222)
noise <- rnorm(n = 100, mean = 0, sd = 2)
y <- 1.2 + 2.1 * x + noise
plot(x, y)
```

Now we have data that are a combination of a linear, deterministic component (\(y = 1.2 + 2.1x\)) and random noise drawn from an \(N(0, 2)\) distribution. These are the basic assumptions we make about our data when we fit a traditional linear model. Below we use the `lm()`

function to “recover” the “true” values we used to generate the data, of which there are three:

- The intercept: 1.2
- The slope: 2.1
- The standard deviation: 2

```
m <- lm(y ~ x)
summary(m)
```

```
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-5.5831 -1.2165 0.3288 1.3022 4.3714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.27426 0.44720 2.849 0.00534 **
x 2.09449 0.07338 28.541 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.926 on 98 degrees of freedom
Multiple R-squared: 0.8926, Adjusted R-squared: 0.8915
F-statistic: 814.6 on 1 and 98 DF, p-value: < 2.2e-16
```

The intercept and slope estimates in the "Coefficients" section, 1.27 and 2.09, are quite close to 1.2 and 2.1, respectively. The residual standard error of 1.926 is also quite close to the *constant value* of 2. We produced a “good” model because we knew how `y`

was generated and gave the `lm()`

function the “correct” model to fit. While we can’t do this in real life, it’s a useful exercise to help us understand the assumptions of linear modeling.

Now what if the variance was *not* constant? What if we multiplied the standard deviation of 2 by the square root of `x`

? As we see in the plot below, the vertical scatter of the points increases as `x`

increases.

```
set.seed(222)
noise <- rnorm(n = 100, mean = 0, sd = 2 * sqrt(x))
y <- 1.2 + 2.1 * x + noise
plot(x, y)
```

We multiplied 2 by `sqrt(x)`

because we’re specifying standard deviation. If we could specify variance, we would have multiplied 4 by just `x`

.

If we fit the same model using `lm()`

, we get a residual standard error of 4.488.

```
m2 <- lm(y ~ x)
summary(m2)
```

```
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-15.0460 -2.4013 0.5638 2.8734 10.2996
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.313 1.042 1.26 0.211
x 2.096 0.171 12.26 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.488 on 98 degrees of freedom
Multiple R-squared: 0.6051, Adjusted R-squared: 0.6011
F-statistic: 150.2 on 1 and 98 DF, p-value: < 2.2e-16
```

We know that isn’t right, because we simulated the data. There was no constant standard deviation when we created the “noise.” Each random value was drawn from a different normal distribution, each with mean 0 but a standard deviation that varied according to `x`

. This means our assumption of *constant variance* is violated. How would we detect this in real life?

The most common way is plotting residuals versus fitted values. This is easy to do in R. Just call `plot()`

on the model object. This generates four different plots to assess the traditional modeling assumptions. See this blog post for more information. The plots we’re interested in are the 1st and 3rd plots, which we can specify with the `which`

argument.

```
plot(m2, which = 1)
```

```
plot(m2, which = 3)
```

In the first plot we see the variability around 0 *increase* as we move further to the right with bigger fitted values. In the third plot we also see increasing variability as we move to the right, though this time the residuals have been standardized and transformed to the square root scale. The positive trajectory of the smooth red line indicates an *increase* in variance.

So now that we’ve confirmed that our assumption of non-constant variance is invalid, what can we do? One approach is to log transform the data. This sometimes works if your response variable is positive and highly skewed. But that’s not really the case here. `y`

is only slightly skewed. (Call `hist()`

on `y`

to verify.) Plus, we know our data are not on a log scale.

Another approach is to *model* the non-constant variance. That’s the topic of this blog post.

To do this, we will use functions from the nlme package that is included with the base installation of R. The workhorse function is `gls()`

, which stands for “generalized least squares.” We use it just as we would the `lm()`

function, except we also use the `weights`

argument along with a handful of *variance functions*. Let’s demonstrate with an example.

Below we fit the “correct” model to our data that exhibited non-constant variance. We load the nlme package so we can use the `gls()`

function.^{1} We specify the model syntax as before, `y ~ x`

. Then we use the `weights`

argument to specify the variance function, in this case `varFixed()`

, also part of the nlme package. This says our variance function has no parameters and a single covariate, `x`

, which is precisely how we generated the data. The syntax, `~x`

, is a *one-sided formula* that can be read as “model variance as a function of `x`

.”

```
library(nlme)
vm1 <- gls(y ~ x, weights = varFixed(~x))
summary(vm1)
```

```
Generalized least squares fit by REML
Model: y ~ x
Data: NULL
AIC BIC logLik
576.2928 584.0477 -285.1464
Variance function:
Structure: fixed weights
Formula: ~x
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.369583 0.6936599 1.974431 0.0511
x 2.085425 0.1504863 13.857900 0.0000
Correlation:
(Intr)
x -0.838
Standardized residuals:
Min Q1 Med Q3 Max
-2.8942967 -0.6293867 0.1551594 0.6758773 2.2722755
Residual standard error: 1.925342
Degrees of freedom: 100 total; 98 residual
```

The result produces a residual standard error of 1.925 that is very close to 2, the “true” value we used to generate the data. The intercept and slope values, 1.37 and 2.09, are also very close to 1.2 and 2.1.

Once again we can call `plot()`

on the model object. In this case only one plot is generated: standardized residuals versus fitted values:

```
plot(vm1)
```

This plot looks good. As long as we model our variance as a function of `x`

, the fitted model neither overfits nor underfits in any systematic sort of way (unlike when we used `lm()`

to fit the model and assumed constant variance.)

The `varFixed()`

function creates *weights* for each observation, symbolized as \(w_i\). The idea is that the higher the weight a given observation has, the smaller the variance of the normal distribution from which its noise component was drawn. In our example, as `x`

increases the variance increases. Therefore higher weights should be associated with smaller `x`

values. This can be accomplished by taking the reciprocal of `x`

, that is \(w_i = 1/x\). So when \(x = 2\), \(w = 1/2\). When \(x = 10\), \(w = 1/10\). Larger `x`

get smaller weights.

Finally, to ensure larger weights are associated with smaller variances, we divide the constant variance by the weight. Stated mathematically,

\[\epsilon_i \sim N(0, \sigma^2/w_i)\]

Or taking the square root to express standard deviation,

\[\epsilon_i \sim N(0, \sigma/ \sqrt{w_i})\]

So the larger the denominator (i.e., the larger the weight and, hence, the smaller the `x`

), the smaller the variance and more precise the observed `y`

.

Incidentally, we can use `lm()`

to weight observations as well. It too has a `weights`

argument like the `gls()`

function. The only difference is we have to be more explicit about how we express the weights. In this case, we have to specify the reciprocal of `x`

. Notice the result is almost identical to what we get using `gls()`

and `varFixed()`

.

```
m3 <- lm(y ~ x, weights = 1/x)
summary(m3)
```

```
Call:
lm(formula = y ~ x, weights = 1/x)
Weighted Residuals:
Min 1Q Median 3Q Max
-5.5725 -1.2118 0.2987 1.3013 4.3749
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3696 0.6937 1.974 0.0511 .
x 2.0854 0.1505 13.858 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.925 on 98 degrees of freedom
Multiple R-squared: 0.6621, Adjusted R-squared: 0.6587
F-statistic: 192 on 1 and 98 DF, p-value: < 2.2e-16
```

The power of the nlme package is that it allows a variety of variance functions. The `varFixed()`

function we just illustrated is the simplest and something that can be done with `lm()`

. The other variance functions include:

`varIdent()`

`varPower()`

`varExp()`

`varConstPower()`

`varComb()`

Let’s explore each of these using simulated data.

`varIdent()`

The `varIdent()`

function allows us to model different variances per stratum. To see how it works, we’ll first simulate data with this property. Below we use `set.seed(11)`

in case someone wants to simulate the same random data. Then we set `n`

equal to 400, the number of observations we will simulate. `x`

is generated the same as before. In this example, we include an additional predictor called `g`

which can be thought of as gender. We randomly sample 400 values of `“m”`

and `“f”`

. Next we simulate a vector of two standard deviations and save it as `msd`

. If `g == "m"`

, then the standard deviation is \(2 \times 2.5\). Otherwise it’s \(2\) for `g == "f"`

. We use this vector in the next line to generate `y`

. Notice where `msd`

is plugged into the `rnorm`

function. Also notice how we generate `y`

. We include an *interaction* between `x`

and `y`

. When `g == "f"`

, the intercept and slope is 1.2 and 2.1. When `g == "m"`

, the intercept and slope are (1.2 + 2.8) and (2.1 + 2.8), respectively. Finally we place our simulated data into a data frame and plot with ggplot2.

```
set.seed(11)
n <- 400
x <- seq(1,10, length.out = n)
g <- sample(c("m","f"), size = n, replace = TRUE)
msd <- ifelse(g=="m", 2*2.5, 2)
y <- 1.2 + 2.1*x - 1.5*(g == "m") + 2.8*x*(g == "m") + rnorm(n, sd = msd)
d <- data.frame(y, x, g)
library(ggplot2)
ggplot(d, aes(x, y, color = g)) +
geom_point()
```

Notice the different variances for each level of `g`

. The variance for `“m”`

is much greater than the variance for `“f”`

. It has a lot more scatter than `“f”`

. We simulated the data that way. We set the variance for `“m”`

to be 2.5 times that of `“f”`

.

Now let’s use the `gls()`

function with `varIdent()`

in attempt to recover these true values. We follow the same approach as before, defining our variance function in the `weights`

argument. Below we specify in the `form`

argument that the formula for modeling variance is conditional on `g`

. The expression `~ 1|g`

is a one-sided formula that says the variance differs between the levels of `g`

. The `1`

just means we have no additional covariates in our model for variance. It is possible to include a formula such as `~ x|g`

, but that would be incorrect in this case since we did not use `x`

in generating the variance. Looking at the plot also shows that while the variability in `y`

is different between the groups, the variability *does not increase* as `x`

increases.

```
vm2 <- gls(y ~ x*g, data = d, weights = varIdent(form = ~ 1|g))
summary(vm2)
```

```
Generalized least squares fit by REML
Model: y ~ x * g
Data: d
AIC BIC logLik
2081.752 2105.64 -1034.876
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | g
Parameter estimates:
f m
1.00000 2.58127
Coefficients:
Value Std.Error t-value p-value
(Intercept) 0.9686349 0.3237724 2.99172 0.0029
x 2.1222707 0.0525024 40.42239 0.0000
gm -1.9765090 0.9352152 -2.11343 0.0352
x:gm 2.9957974 0.1553551 19.28355 0.0000
Correlation:
(Intr) x gm
x -0.901
gm -0.346 0.312
x:gm 0.304 -0.338 -0.907
Standardized residuals:
Min Q1 Med Q3 Max
-2.74115039 -0.67013954 0.01619031 0.69793776 2.72985748
Residual standard error: 2.005397
Degrees of freedom: 400 total; 396 residual
```

First, notice at the bottom of the output that the estimated residual standard error is 2.005397, very close to the “true” value of 2. Also notice in the “Variance function” section we get an estimated value of 2.58127 for group `“m”`

, which is very close to the “true” value of 2.5 we used to generate the different variance for group `“m”`

. In general, when you use the `varIdent()`

function to estimate different variances between levels of strata, one of the levels will be set to baseline, and the others will be estimated as multiples of the residual standard error. In this case since `“f”`

comes before `“m”`

alphabetically, `“f”`

was set to baseline, or 1. The estimated residual standard error for group `“f”`

is \(2.005397 \times 1\). The estimated residual standard error for group `“m”`

is \(2.005397 \times 2.58127\).

It’s important to note these are just estimates. To get a feel for the uncertainty of these estimates, we can use the `intervals()`

function from the nlme package to get 95% confidence intervals. To reduce the output, we save the result and view selected elements of the object.

```
int <- intervals(vm2)
```

The `varStruct`

element contains the 95% confidence interval for the parameter estimated in the variance function. The parameter in this case is the residual standard error multiplier for the male group.

```
int$varStruct
```

```
lower est. upper
m 2.245615 2.58127 2.967095
attr(,"label")
[1] "Variance function:"
```

The `sigma`

element contains the 95% confidence interval for the residual standard error.

```
int$sigma
```

```
lower est. upper
1.818639 2.005397 2.211335
attr(,"label")
[1] "Residual standard error:"
```

Both intervals are quite small and contain the “true” value we used to generate the data.

`varPower()`

The `varPower()`

function allows us to model variance as a power of the absolute value of a covariate. Once again, to help explain this we will simulate data with this property. Below, the main thing to notice is the `sd`

argument of the `rnorm()`

function. That’s where we take the standard deviation of 2 and then multiply it by the absolute value of `x`

raised to the power of 1.5. This is similar to how we simulated data to demonstrate the `varFixed()`

function above. In that case we simply assumed the exponent was 0.5. (Recall taking the square root of a number is equivalent to raising it to the power of 0.5.) Here we arbitrarily picked a power of 1.5. When we use `gls()`

with `varPower()`

we will attempt to recover the “true” value of 1.5 in addition to 2.

```
set.seed(4)
n <- 1000
x <- seq(1,10,length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*abs(x)^1.5)
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()
```

We see the data exhibiting the classic form of variance increasing as the predictor increases. To correctly model this data using `gls()`

, we supply it with the formula `y ~ x`

and use the `weights`

argument with `varPower()`

. Notice we specify the one-sided formula just as we did with the `varFixed()`

function. In this model, however, we’ll get an estimate for the power.

```
vm3 <- gls(y ~ x, data = d, weights = varPower(form = ~x))
summary(vm3)
```

```
Generalized least squares fit by REML
Model: y ~ x
Data: d
AIC BIC logLik
8840.188 8859.811 -4416.094
Variance function:
Structure: Power of variance covariate
Formula: ~x
Parameter estimates:
power
1.520915
Coefficients:
Value Std.Error t-value p-value
(Intercept) 2.321502 0.4750149 4.887218 0
x 1.582272 0.2241367 7.059404 0
Correlation:
(Intr)
x -0.845
Standardized residuals:
Min Q1 Med Q3 Max
-2.892319514 -0.655237190 0.001653178 0.691241690 3.346273816
Residual standard error: 1.872944
Degrees of freedom: 1000 total; 998 residual
```

```
int <- intervals(vm3)
int$varStruct
```

```
lower est. upper
power 1.445064 1.520915 1.596765
attr(,"label")
[1] "Variance function:"
```

```
int$sigma
```

```
lower est. upper
1.650987 1.872944 2.124740
attr(,"label")
[1] "Residual standard error:"
```

The power is estimated to be 1.52 which is very close to the “true” value of 1.5. The estimated residual standard error of 1.872944 is also quite close to 2. Both intervals capture the values we used to simulate the data. The coefficients in the model, on the other hand, are somewhat poorly estimated. This is not surprising given how much variability is in `y`

, especially for \(x > 2\).

`varExp()`

The `varExp()`

function allows us to model variance as an exponential function of a covariate. Yet again we’ll explain this variance function using simulated data. The only change is in the `sd`

argument of the `rnorm()`

function. We have a fixed value of 2 that we multiply by `x`

which is itself multiplied by 0.5 and exponentiated. Notice how rapidly the variance increases as we increase `x`

. This is due to the exponential growth of the variance.

```
set.seed(55)
n <- 400
x <- seq(1, 10, length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*exp(0.5*x))
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()
```

To work backwards and recover these values, we use the `varExp()`

function in the `weights`

argument of `gls()`

. The one-sided formula does not change in this case. It says model the variance as a function of `x`

. The `varExp()`

function says that `x`

has been multiplied by some value and exponentiated, so `gls()`

will try to estimate that value.

```
vm4 <- gls(y ~ x, data = d, weights = varExp(form = ~x))
summary(vm4)
```

```
Generalized least squares fit by REML
Model: y ~ x
Data: d
AIC BIC logLik
3873.153 3889.098 -1932.576
Variance function:
Structure: Exponential of variance covariate
Formula: ~x
Parameter estimates:
expon
0.4845623
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.198525 1.1172841 1.072713 0.284
x 2.073297 0.4933502 4.202486 0.000
Correlation:
(Intr)
x -0.892
Standardized residuals:
Min Q1 Med Q3 Max
-3.16976627 -0.69525696 0.00614222 0.63718022 2.90423217
Residual standard error: 2.119192
Degrees of freedom: 400 total; 398 residual
```

```
int <- intervals(vm4)
int$varStruct
```

```
lower est. upper
expon 0.4562871 0.4845623 0.5128374
attr(,"label")
[1] "Variance function:"
```

```
int$sigma
```

```
lower est. upper
1.786737 2.119192 2.513505
attr(,"label")
[1] "Residual standard error:"
```

The “expon” estimate of 0.4845623 in the “Variance function” section is very close to our specified value of 0.5. Likewise, the estimated residual standard error of 2.119192 is close to the “true” value of 2. The model coefficient estimates are also close to the values we used to generate the data, but notice the uncertainty in the intercept. The hypothesis test in the summary can’t rule out a negative intercept. This again is not surprising since `y`

has so much non-constant variance and since we have no observations of `x`

equal to 0. Since the intercept is the value `y`

takes when `x`

equals 0, our estimated intercept is an extrapolation to an event we did not observe.

`varConstPower()`

The `varConstPower()`

function allows us to model variance as a positive constant plus a power of the absolute value of a covariate. That’s a mouthful, but this is basically the same as the `varPower()`

function except now we add a positive constant to it. The following code simulates data for which the `varConstPower()`

function would be suitable to use. Notice it is identical to the code we used to simulate data for the `varPower()`

section, except we add 0.7 to `abs(x)^1.5`

. Why 0.7? No special reason. It’s just a positive constant we picked.

```
set.seed(4)
n <- 1000
x <- seq(1,10,length.out = n)
y <- 1.2 + 2.1*x + rnorm(n, sd = 2*(0.7 + abs(x)^1.5))
d <- data.frame(y, x)
ggplot(d, aes(x, y)) + geom_point()
```

The correct way to model this data is to use `gls()`

with the `varConstPower()`

function. The one-sided formula is the same as before. The summary returns three estimates for the variance model: the constant, the power and the residual standard error. Recall the “true” values are 0.7, 1.5 and 2, respectively.

```
vm5 <- gls(y ~ x, data = d, weights = varConstPower(form = ~x))
summary(vm5)
```

```
Generalized least squares fit by REML
Model: y ~ x
Data: d
AIC BIC logLik
9033.253 9057.782 -4511.627
Variance function:
Structure: Constant plus power of variance covariate
Formula: ~x
Parameter estimates:
const power
0.3388157 1.4463162
Coefficients:
Value Std.Error t-value p-value
(Intercept) 2.550278 0.6140015 4.153538 0
x 1.527665 0.2517976 6.067037 0
Correlation:
(Intr)
x -0.831
Standardized residuals:
Min Q1 Med Q3 Max
-2.843996437 -0.660034231 0.000752882 0.681577120 3.299594323
Residual standard error: 2.199816
Degrees of freedom: 1000 total; 998 residual
```

```
int <- intervals(vm5)
int$varStruct
```

```
lower est. upper
const 0.6010367 0.8170943 1.110819
power 1.4849924 1.5343144 1.583636
attr(,"label")
[1] "Variance function:"
```

```
int$sigma
```

```
lower est. upper
1.698709 1.877371 2.074822
attr(,"label")
[1] "Residual standard error:"
```

The intervals are fairly tight and contain the “true” values we used to generate the data.

`varComb()`

Finally, the `varComb()`

function allows us to model combinations of two or more variance models by multiplying together the corresponding variance functions. Obviously this can accommodate some very complex variance models. We’ll simulate some basic data that would be appropriate to use with the `varComb()`

function.

What we do below is combine two different variance processes:

- One that allows standard deviation to differ between gender (
`“f”`

= \(2\),`“m”`

= \(2 \times 2.5\)) - One that allows standard deviation to increase as
`x`

increases, where`x`

is multiplied by 1.5 and exponentiated

To help with visualizing the data we have limited `x`

to range from 1 to 3.

```
set.seed(77)
n <- 400
x <- seq(1, 3, length.out = n)
g <- sample(c("m","f"), size = n, replace = TRUE)
msd <- ifelse(g=="m", 2*2.5, 2) * exp(1.5*x)
y <- 1.2 + 2.1*x - 1.5*(g == "m") + 2.8*x*(g == "m") + rnorm(n, sd = msd)
d <- data.frame(y, x, g)
ggplot(d, aes(x, y, color = g)) +
geom_point()
```

The plot shows increasing variance as `x`

increases, but also differences in variance between genders. The variance of `y`

for group `“m”`

is much greater than the variance of `y`

in group `“f”`

, especially when `x`

is greater than 1.5.

To correctly model the data generating process we specified above and attempt to recover the true values, we use the `varComb()`

function as a wrapper around two more variance functions: `varIdent()`

and `varExp()`

. Why these two? Because we have different variances between genders, and we have variance increasing exponentially as a function of `x`

.

```
vm6 <- gls(y ~ x*g, data = d,
weights = varComb(varIdent(form = ~ 1|g),
varExp(form = ~x)))
summary(vm6)
```

```
Generalized least squares fit by REML
Model: y ~ x * g
Data: d
AIC BIC logLik
4431.45 4459.32 -2208.725
Combination of variance functions:
Structure: Different standard deviations per stratum
Formula: ~1 | g
Parameter estimates:
f m
1.000000 2.437046
Structure: Exponential of variance covariate
Formula: ~x
Parameter estimates:
expon
1.540608
Coefficients:
Value Std.Error t-value p-value
(Intercept) -5.996337 6.539552 -0.9169339 0.3597
x 8.829971 4.849272 1.8208860 0.0694
gm -17.572238 16.548540 -1.0618603 0.2889
x:gm 16.932938 12.192793 1.3887661 0.1657
Correlation:
(Intr) x gm
x -0.972
gm -0.395 0.384
x:gm 0.387 -0.398 -0.974
Standardized residuals:
Min Q1 Med Q3 Max
-2.74441479 -0.67478610 -0.00262221 0.66079254 3.14975288
Residual standard error: 1.794185
Degrees of freedom: 400 total; 396 residual
```

The summary section contains two sections for modeling variance. The first estimates the multiplier for the `“m”`

group to be 2.437, which is very close to the true value of 2.5. The exponential parameter is estimated to be 1.54, extremely close to the true value of 1.5 we used when generating the data. Finally the residual standard error is estimated to be about 1.79, close to the true value of 2. Calling `intervals(vm6)`

shows very tight confidence intervals. The `A`

and `B`

prefixes in the `varStruct`

element are just labels for the two different variance models.

```
int <- intervals(vm6)
int$varStruct
```

```
lower est. upper
A.m 2.119874 2.437046 2.801673
B.expon 1.419366 1.540608 1.661850
attr(,"label")
[1] "Variance function:"
```

```
int$sigma
```

```
lower est. upper
1.380008 1.794185 2.332669
attr(,"label")
[1] "Residual standard error:"
```

Unfortunately, due to the large exponential variability, the estimates of the model coefficients are woefully bad.

## What’s the point?

So why did we simulate all this fake data and then attempt to recover “true” values? What good is that? Fair questions. The answer is it helps us understand what assumptions we’re making when we specify and fit a model. When we specify and fit the following model…

```
m <- lm(y ~ x1 + x2)
```

…we’re saying we think `y`

is approximately equal to a weighted sum of `x1`

and `x2`

(plus an intercept), with errors being random draws from a normal distribution with mean of 0 and some *fixed* standard deviation. Likewise, if we specify and fit the following model…

```
m <- gls(y ~ x1 + x2, data = d, weights = varPower(form = ~x1))
```

…we’re saying we think `y`

is approximately equal to a weighted sum of `x1`

and `x2`

(plus an intercept), with errors being random draws from a normal distribution with mean of 0 and a standard deviation that *grows as a multiple of x1 raised by some power*.

If we can *simulate* data suitable for those models, then we have a better understanding of the *assumptions* we’re making when we use those models. Hopefully you now have a better understanding of what you can do to model variance using the nlme package.

## References

- Pinheiro, J., & Bates, D. (2000).
*Mixed-effects models in S and S-PLUS*. Springer. https://doi.org/10.1007/b98882 - Pinheiro, J., Bates, D., & R Core Team. (2020).
*nlme: Linear and nonlinear mixed effects models*(Version 3.1-162). https://cran.r-project.org/package=nlme - R Core Team. (2020).
*R: A language and environment for statistical computing*. R Foundation for Statistical Computing. https://www.r-project.org/ - Wickham, H. (2016).
*ggplot2: Elegant graphics for data analysis*. Springer-Verlag.

*Clay Ford
Statistical Research Consultant
University of Virginia Library
April 07, 2020*

- The nlme package is perhaps better known for its
`lme()`

function, which is used to fit mixed-effect models (i.e., models with both fixed and random effects). This blog post demonstrates variance functions using`gls()`

, which does not fit random effects. However, everything we present in this blog post can be used with`lme()`

. ↩

**For questions or clarifications regarding this article, contact statlab@virginia.edu.**

View the entire collection of UVA Library StatLab articles, or learn how to cite.