Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time, perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for auto accidents?

What if we learned one intersection had 200,000 automobiles pass through it during the observation period while the other only had 10,000? That changes our interpretation of the counts quite a bit. They no longer seem “equal.” If we divide the counts by the number of accidents, we get \(25/200,000 = 0.000125\) and \(25/10,000 = 0.0025\), respectively. The latter intersection has a higher *rate* of accidents. Dividing 1 by the rates tells us that the first intersection has about one accident for every \(1/0.000125 = 8000\) cars that come through the intersection, while the second has about one accident for every \(1/0.0025 = 400\) cars. Clearly the second intersection appears more dangerous. Therefore it seems we should *model the rate of auto accidents* instead of the count.

Let’s learn how to do basic rate modeling using R. This is sometimes referred to as Poisson regression for rates. To begin we will load some data that come from the text *Categorical Data Analysis* (2nd ed.) by Alan Agresti. The data, concerning patient survival after heart valve replacement, come from a paper by Laird and Olivier (1981). A sample of 109 patients were followed for a period of time after having their heart valve replaced. By the end of the observation period, 21 of the patients died. Of interest is whether age or type of heart valve had any association with survival. Age was dichotomized to “under 55” and “55 and over.” There were two types of heart valves: aortic and mitral. We can enter the data as follows:

```
heartvalve <- c("aortic", "aortic", "mitral", "mitral")
age <- c("<55", "55+", "<55", "55+")
deaths <- c(4, 7, 1, 9)
d <- data.frame(heartvalve, age, deaths)
d
```

```
heartvalve age deaths
1 aortic <55 4
2 aortic 55+ 7
3 mitral <55 1
4 mitral 55+ 9
```

Like our intersection example above, these counts are not necessarily equal. Not every subject was under observation for the same amount of time. Some may have been observed for 90 months while others were only observed for 3 months. Therefore we should model the rate of death instead of the count. To do this we need *exposure* data, or time at risk. Let’s add this to our data frame. Below we add a column called `risktime`

that represents months of observation.

```
d$risktime <- c(1259, 1417, 2082, 1647)
d
```

```
heartvalve age deaths risktime
1 aortic <55 4 1259
2 aortic 55+ 7 1417
3 mitral <55 1 2082
4 mitral 55+ 9 1647
```

We see, for example, 4 deaths occurred in 1259 months of observation. These months include subjects that were observed and never died. (Recall the sample size is 109.) To calculate rate of death, we divide `deaths`

by `risktime`

:

```
d$rate <- round(d$deaths/d$risktime, 4)
d
```

```
heartvalve age deaths risktime rate
1 aortic <55 4 1259 0.0032
2 aortic 55+ 7 1417 0.0049
3 mitral <55 1 2082 0.0005
4 mitral 55+ 9 1647 0.0055
```

There appears to be a higher rate of death for the older age group.

Let’s now model the rate of death as a function of age and heart valve. What predictive effects, if any, do these factors have on the rate of death? To do this we use the `glm()`

function with `family = poisson`

as follows:

```
m <- glm(deaths ~ age + heartvalve + offset(log(risktime)),
data = d,
family = poisson)
```

The first part of the formula, `deaths ~ heartvalve + age`

, is pretty straight ahead. It says model `deaths`

as a function of `heartvalve`

and `age`

in an additive fashion. That means the effect of `age`

does not depend on `heartvalve`

and vice versa. But what about the last part: `+ offset(log(risktime))`

?

Clearly that’s how we include `risktime`

so we can model the rate. But why do we take the log and wrap it in something called `offset()`

? To understand this, let’s first define our model as a Poisson count model (as opposed to a rate model).

\[\text{deaths} \sim \text{Poisson}(\text{exp}(\beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}))\]

This says the expected *count* of "deaths" is a random draw from a Poisson distribution with a mean of \(\text{exp}(\beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve})\). Notice the mean is *conditional* on “age” and “heartvalve”. The expected death count depends on a weighted sum of "age" and "heartvalve" along with a fixed constant. This is the model we have *proposed*. (It may not be a good model.) The reason we exponentiate the formula using `exp()`

is to ensure the mean is *positive*. A Poisson distribution generates counts and requires a positive mean. (Recall that \(e\) raised to any number, positive or negative, is positive.)

We can rewrite our model in the more traditional linear model format like so:

\[\text{deaths} = \text{exp}(\beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve})\] \[\text{log}(\text{deaths}) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}\]

Notice that taking the log of both sides produces the familiar linear model on the right-hand side. We refer to the log as the *link* function, because it’s the link to the linear model.

Of course we’re not modeling counts, but rates, so let’s restate our model incorporating "risktime".

\[\text{log}\left( \frac{\text{deaths}}{\text{risktime}} \right) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}\]

Recalling the rules of logs, we can rewrite the left-hand side.

\[\text{log}(\text{deaths}) - \text{log}(\text{risktime}) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve}\]

And then we can add \(\text{log}(\text{risktime})\) to both sides.

\[\text{log}(\text{deaths}) = \beta_0 + \beta_1\text{age} + \beta_2\text{heartvalve} + \text{log}(\text{risktime})\]

And that is what we modeled above. Notice there is no \(\beta\) coefficient for \(\text{log}(\text{risktime})\). We simply add it to the right-hand side. This is called an *offset*. We define an offset in `glm()`

using the `offset()`

function. Hence this is why we added `+ offset(log(risktime))`

to our model above.

Let’s take a look at the summary of our model.

```
summary(m)
```

```
Call:
glm(formula = deaths ~ age + heartvalve + offset(log(risktime)),
family = poisson, data = d)
Deviance Residuals:
1 2 3 4
1.025 -0.602 -1.197 0.613
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.3121 0.5066 -12.460 <2e-16 ***
age55+ 1.2209 0.5138 2.376 0.0175 *
heartvalvemitral -0.3299 0.4382 -0.753 0.4515
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 10.8405 on 3 degrees of freedom
Residual deviance: 3.2225 on 1 degrees of freedom
AIC: 22.349
Number of Fisher Scoring iterations: 5
```

In the `Estimate`

column we see the estimated values of our model coefficients (ie, the \(\beta\) values). The coefficient for `age55+`

, 1.221, is the predictive effect of being 55 or older. The output is on the log scale. Exponentiating it returns the multiplicative effect of age.

```
exp(coef(m)[2])
```

```
age55+
3.390401
```

The estimated rate of death for subjects 55 or older is about 3.4 times that of subjects 55 or younger. The confidence interval for this coefficient can be calculated as follows:

```
exp(confint(m))[2,]
```

```
Waiting for profiling to be done...
2.5 % 97.5 %
1.323921 10.389913
```

Pretty wide, but it doesn’t overlap 1. (Remember we exponentiated. \(e^0 = 1\)) It appears the death rate for subjects 55 and older is at least 30% higher than subjects under 55. (Multiplying by 1.3 implies an increase of 30%. 1.3 is the lower bound of the 95% confidence interval.)

What are the expected rates according to our model? If we use the `fitted()`

function we get expected counts.

```
cbind(d[c("heartvalve","age")], count_fit = fitted(m))
```

```
heartvalve age count_fit
1 aortic <55 2.284108
2 aortic 55+ 8.715892
3 mitral <55 2.715892
4 mitral 55+ 7.284108
```

To get expected rates, we need to use `predict()`

with `type = "response"`

and a supplied offset value, and then divide the predicted values by the offset. It doesn’t matter which value we pick for the offset. It can be 1000 or 5. The expected counts will change for different offset values, but the rate is always the same.

```
# 1000
p <- predict(m, type = "response",
newdata = data.frame(heartvalve, age, risktime = 1000))
cbind(d[c("heartvalve","age")], rate_fit = round(p/1000,4))
```

```
heartvalve age rate_fit
1 aortic <55 0.0018
2 aortic 55+ 0.0062
3 mitral <55 0.0013
4 mitral 55+ 0.0044
```

```
# or use 5
p <- predict(m, type = "response",
newdata = data.frame(heartvalve, age, risktime = 5))
cbind(d[c("heartvalve","age")], rate_fit = round(p/5,4))
```

```
heartvalve age rate_fit
1 aortic <55 0.0018
2 aortic 55+ 0.0062
3 mitral <55 0.0013
4 mitral 55+ 0.0044
```

An expected rate of 0.0062 for subjects 55 and older with a replaced aortic heart valve corresponds to about 1 death every 161 months (1/0.0062 = 161.3).

We can visualize our model using effect plots. The ggeffects package helps us do this. First we use the `ggpredict()`

function to calculate expected counts at the 4 distinct levels along with confidence intervals. As we did with the `predict()`

function, we need to supply an offset value. We supply this to the `condition`

argument as a vector. The `condition`

argument indicates which predictors should be held constant at specific values when making predictions. Again we choose 1000 for `risktime`

. Even though the output is nicely formatted, the object `eff_out`

is a data frame we can use for plotting.

```
# install.packages("ggeffects")
library(ggeffects)
eff_out <- ggpredict(m, terms = c("age", "heartvalve"), condition = c(risktime = 1000))
eff_out
```

```
# Predicted values of deaths
# x = age
# heartvalve = aortic
x | Predicted | 95% CI
-------------------------------
<55 | 1.81 | [0.67, 4.90]
55+ | 6.15 | [3.29, 11.51]
# heartvalve = mitral
x | Predicted | 95% CI
------------------------------
<55 | 1.30 | [0.50, 3.41]
55+ | 4.42 | [2.25, 8.71]
```

```
# is it a data frame?
is.data.frame(eff_out)
```

```
[1] TRUE
```

```
# the column names of the data frame:
names(eff_out)
```

```
[1] "x" "predicted" "std.error" "conf.low" "conf.high" "group"
```

The `x`

column name refers to `age`

since we listed it first in the `terms`

argument for `ggpredict`

. The `group`

column refers to `heartvalve`

since we listed it second. The `predicted`

column is the expected count. To plot expected rates, we need to divide the `predicted`

, `conf.low`

, and `conf.high`

columns by 1,000.

```
col <- c("predicted", "conf.low", "conf.high")
eff_out[,col] <- eff_out[,col]/1000
```

Then we can create our effect plot. For this we use the ggplot2 package. Notice we use the `position_dodge()`

function to define how to “dodge” the plotted points and error bars. Smaller values of `width`

bring the points and bars closer together. If we don’t do this, the points and error bars will overplot on each other.

```
library(ggplot2)
pd <- position_dodge(width = 0.5)
ggplot(eff_out, aes(x = x, y = predicted, color = group)) +
geom_point(position = pd) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.1,
position = pd) +
labs(x = "Age", y = "Expected rate")
```

We can see the difference in age groups is much more pronounced than the difference between heart valves. In fact the model summary above returned an inconclusive effect of `heartvalve`

. We might want to drop `heartvalve`

from the model. The `update()`

function makes quick work of this. The formula notation `. ~ . -heartvalve`

says, “Keep everything on the left and right side except heartvalve and refit.” The resulting likelihood ratio test (LRT) tells us the larger model with `heartvalve`

doesn’t appear to be that much better than the model without.

```
m2 <- update(m, . ~ . -heartvalve, data = d)
anova(m2, m, test = "LRT")
```

```
Analysis of Deviance Table
Model 1: deaths ~ age + offset(log(risktime))
Model 2: deaths ~ age + heartvalve + offset(log(risktime))
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2 3.7897
2 1 3.2225 1 0.56721 0.4514
```

We can then create a new effect plot with our simplified model. Notice we no longer need to worry about dodging points.

```
eff_out2 <- ggpredict(m2, terms = "age", offset = log(1000))
eff_out2[,col] <- eff_out2[,col]/1000
ggplot(eff_out2, aes(x = x, y = predicted)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high),
width = 0.1) +
labs(x = "Age", y = "Expected rate")
```

Hopefully you have a better understanding of how rate models work and when to use them. In addition to Agresti (2002), Gelman and Hill (2007) provide a good overview of rate models.

## References

- Agresti, A. (2002).
*Categorical data analysis*(2nd ed.). Wiley. Chapter 9. - Gelman, A., & Hill, J. (2007).
*Data analysis using regression and multilevel/hierarchical models*. Cambridge University Press. Chapter 6. - Laird, N. M., & Olivier, D. (1981). Covariance analysis of censored survival data using log-linear analysis techniques.
*Journal of the American Statistical Association, 76*(374), 231--240. - Lüdecke, D. (2018). ggeffects: Tidy data frames of marginal effects from regression models.
*Journal of Open Source Software, 3*(26), 772. https://doi.org/10.21105/joss.00772 - R Core Team. (2020).
*R: A language and environment for statistical computing*. R Foundation for Statistical Computing.

For questions or clarifications regarding this article, contact the UVA Library StatLab: statlab@virginia.edu View the entire collection of UVA Library StatLab articles.

Clay Ford

Statistical Research Consultant

University of Virginia Library

May 30, 2020

**For questions or clarifications regarding this article, contact statlab@virginia.edu.**

View the entire collection of UVA Library StatLab articles, or learn how to cite.