Understanding Ordered Factors in a Linear Model

Consider the following data from the text Design and Analysis of Experiments, 7th ed. (Montgomery, 2009, Table 3.1). It has two variables: power and rate. power is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. rate is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate. Below we enter the small data set by hand into R and create a quick plot. The plot suggests higher power settings lead to higher etch rates.


# Table 3.1
power <- rep(c(160, 180, 200, 220), 5)
rate <- c(575, 565, 600, 725,
          542, 593, 651, 700, 
          530, 590, 610, 715,
          539, 579, 637, 685,
          570, 610, 629, 710)
etch <- data.frame(rate, power)
stripchart(rate ~ power, data = etch, pch = 1,
           vertical = TRUE, xlab = "power")

One scatterplot

A formal statistical analysis of this data requires a linear model. To perform the analysis in R, we need to define the power variable as a factor. This tells R that power is a categorical variable and to treat it as such in a modeling procedure. But in addition, we may want to further specify that the levels of power are ordered. Power level 160 is less than 180, and 180 is less than 200, and so on. This is additional information we may want to account for. We can define power as an ordered factor in R using the ordered() function. We do that below and save the ordered factor version as powerF. Notice that calling head() to view the first 6 values of powerF shows us the ordering of the levels: 160 < 180 < 200 < 220.


etch$powerF <- ordered(etch$power)
head(etch$powerF)


  [1] 160 180 200 220 160 180
  Levels: 160 < 180 < 200 < 220

It’s important to note that R is storing powerF internally as integers, which we can see by calling unclass() or as.integer() on the variable. (We'll make use of this fact in the next plot we create.)


unclass(etch$powerF)


   [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4


attr(etch$powerF,"levels")


   [1] "160" "180" "200" "220"

Now let’s formally analyze the data using the lm() function in R. This is equivalent to performing a one-way ANOVA. Below we model rate as a function of the ordered factor powerF and pipe the result into the summary() function using base R’s native forward pipe operator, |>, introduced in version 4.1.0.


lm(rate ~ powerF, data = etch) |> summary()


  Call:
  lm(formula = rate ~ powerF, data = etch)
  
  Residuals:
     Min     1Q Median     3Q    Max 
   -25.4  -13.0    2.8   13.2   25.6 
  
  Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
  (Intercept)  617.750      4.085 151.234  < 2e-16 ***
  powerF.L     113.011      8.169  13.833 2.56e-10 ***
  powerF.Q      22.700      8.169   2.779   0.0134 *  
  powerF.C       9.347      8.169   1.144   0.2694    
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 18.27 on 16 degrees of freedom
  Multiple R-squared:  0.9261, Adjusted R-squared:  0.9122 
  F-statistic:  66.8 on 3 and 16 DF,  p-value: 2.883e-09

At the bottom of the output is the familiar ANOVA test result. The F-statistic is large and the p-value is tiny, which provides evidence that the mean rates between the four power levels are different. (We have statistically sanctified what we saw in the plot.) In addition, the adjusted R-squared suggests power explains about 91% of the variability we’re seeing in the etch rates. But what about the rest? What are these mysterious “powerF.L”, “powerF.Q”, and “powerF.C” labels? What do their estimates mean? What are the hypothesis tests associated with them?

The quick answers to these questions are that the L, Q, and C stand for linear, quadratic, and cubic; that the coefficients don't have a ready interpretation; and that each hypothesis is a test for trend.

The “powerF.L” entry is a test for linear trend. The null is no linear trend (i.e., a flat line). The large t-value provides evidence against this. Again, this is not a surprise. Looking at the plot above, we can visualize a straight line over the points.

The “powerF.Q” entry is a test for quadratic trend. The null is no quadratic trend (i.e., a straight line). The t-value of 2.8 provides some evidence against this. This is less obvious in the plot, though we can perhaps imagine a slight curve superimposed over the data.

The “powerF.C” entry is a test for cubic trend. The null is no cubic trend (i.e., a straight or quadratic line). The t-value of 1.1 is small, and the p-value is quite large. There doesn’t appear to be any cubic trend in the data.

We can visualize these trends using ggplot(). Below we call geom_smooth() three times to plot linear, quadratic and cubic lines using lm(). Notice we have to unclass() the powerF variable to make this work. Defining the colors as “1”, “2”, and “3” is an easy way to force ggplot() to create a legend and place the labels in that order. We notice there isn’t much difference between the fitted quadratic and cubic lines, visually demonstrating why the cubic trend test was not “significant.”


library(ggplot2)
ggplot(etch) +
  aes(x = powerF, y = rate) +
  geom_point(shape = 1) +
  geom_smooth(aes(x = unclass(powerF), color = "1"), 
              formula = y ~ x, 
              method = lm, se = FALSE) +
  geom_smooth(aes(x = unclass(powerF), color = "2"), 
              formula = y ~ poly(x, 2), 
              method = lm, se = FALSE) +
  geom_smooth(aes(x = unclass(powerF), color = "3"), 
              formula = y ~ poly(x, 3), 
              method = lm, se = FALSE) +
  scale_color_discrete("Trend", labels = c("linear", "quadratic", "cubic")) +
  theme_classic()

One lineplot

If you’re thinking these tests for trend look a lot like a polynomial model, you’re right. That’s basically what it is. In fact, we get identical test results (i.e., the same test statistics and p-values) if we simply fit a 3-degree polynomial model to the original numeric power variable.


lm(rate ~ poly(power, 3), data = etch) |> summary()


  Call:
  lm(formula = rate ~ poly(power, 3), data = etch)
  
  Residuals:
     Min     1Q Median     3Q    Max 
   -25.4  -13.0    2.8   13.2   25.6 
  
  Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
  (Intercept)      617.750      4.085 151.234  < 2e-16 ***
  poly(power, 3)1  252.700     18.267  13.833 2.56e-10 ***
  poly(power, 3)2   50.759     18.267   2.779   0.0134 *  
  poly(power, 3)3   20.900     18.267   1.144   0.2694    
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 18.27 on 16 degrees of freedom
  Multiple R-squared:  0.9261, Adjusted R-squared:  0.9122 
  F-statistic:  66.8 on 3 and 16 DF,  p-value: 2.883e-09

We also get the same test results if we unclass() the ordered factor version of power and fit a 3-degree polynomial model.


lm(rate ~ poly(unclass(powerF), 3), data = etch) |> summary()


  Call:
  lm(formula = rate ~ poly(unclass(powerF), 3), data = etch)
  
  Residuals:
     Min     1Q Median     3Q    Max 
   -25.4  -13.0    2.8   13.2   25.6 
  
  Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
  (Intercept)                617.750      4.085 151.234  < 2e-16 ***
  poly(unclass(powerF), 3)1  252.700     18.267  13.833 2.56e-10 ***
  poly(unclass(powerF), 3)2   50.759     18.267   2.779   0.0134 *  
  poly(unclass(powerF), 3)3   20.900     18.267   1.144   0.2694    
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 18.27 on 16 degrees of freedom
  Multiple R-squared:  0.9261, Adjusted R-squared:  0.9122 
  F-statistic:  66.8 on 3 and 16 DF,  p-value: 2.883e-09

Did you notice the coefficients and standard errors of these two models differ from the first model with the ordered factor? That’s because the ordered factor model uses a contrast. A contrast is a matrix that transforms a series of 0/1 dummy variables into columns that can be estimated in a modeling routine. The default contrast for ordered factors in R is the polynomial contrast. We can see the contrast R uses by calling the contr.poly() function. Simply tell it how many levels your categorical variable has. In our case we have 4.


contr.poly(4)


               .L   .Q         .C
  [1,] -0.6708204  0.5 -0.2236068
  [2,] -0.2236068 -0.5  0.6708204
  [3,]  0.2236068 -0.5 -0.6708204
  [4,]  0.6708204  0.5  0.2236068

That’s the contrast used to convert dummy variables of an ordered factor into a model matrix. Let’s demonstrate. Below we generate a 4-column matrix of dummy variables for the power variable. There is one row per observation and one column per level of power. A 1 in a row identifies membership in one of the four power categories.


Xpower <- sapply(c(160, 180, 200, 220), 
               function(x)ifelse(etch$power==x, 1, 0))
colnames(Xpower) <- c("160", "180", "200", "220")
head(Xpower)


       160 180 200 220
  [1,]   1   0   0   0
  [2,]   0   1   0   0
  [3,]   0   0   1   0
  [4,]   0   0   0   1
  [5,]   1   0   0   0
  [6,]   0   1   0   0

Now we multiply the dummy variable matrix by the polynomial contrast to obtain a model matrix that is estimable.


Xpower %*% contr.poly(4)


                .L   .Q         .C
   [1,] -0.6708204  0.5 -0.2236068
   [2,] -0.2236068 -0.5  0.6708204
   [3,]  0.2236068 -0.5 -0.6708204
   [4,]  0.6708204  0.5  0.2236068
   [5,] -0.6708204  0.5 -0.2236068
   [6,] -0.2236068 -0.5  0.6708204
   [7,]  0.2236068 -0.5 -0.6708204
   [8,]  0.6708204  0.5  0.2236068
   [9,] -0.6708204  0.5 -0.2236068
  [10,] -0.2236068 -0.5  0.6708204
  [11,]  0.2236068 -0.5 -0.6708204
  [12,]  0.6708204  0.5  0.2236068
  [13,] -0.6708204  0.5 -0.2236068
  [14,] -0.2236068 -0.5  0.6708204
  [15,]  0.2236068 -0.5 -0.6708204
  [16,]  0.6708204  0.5  0.2236068
  [17,] -0.6708204  0.5 -0.2236068
  [18,] -0.2236068 -0.5  0.6708204
  [19,]  0.2236068 -0.5 -0.6708204
  [20,]  0.6708204  0.5  0.2236068

Notice this is the same model.matrix() that is used when we model the ordered factor with lm(). (The column of ones is for the intercept.)


lm(rate ~ powerF, data = etch) |> model.matrix()


     (Intercept)   powerF.L powerF.Q   powerF.C
  1            1 -0.6708204      0.5 -0.2236068
  2            1 -0.2236068     -0.5  0.6708204
  3            1  0.2236068     -0.5 -0.6708204
  4            1  0.6708204      0.5  0.2236068
  5            1 -0.6708204      0.5 -0.2236068
  6            1 -0.2236068     -0.5  0.6708204
  7            1  0.2236068     -0.5 -0.6708204
  8            1  0.6708204      0.5  0.2236068
  9            1 -0.6708204      0.5 -0.2236068
  10           1 -0.2236068     -0.5  0.6708204
  11           1  0.2236068     -0.5 -0.6708204
  12           1  0.6708204      0.5  0.2236068
  13           1 -0.6708204      0.5 -0.2236068
  14           1 -0.2236068     -0.5  0.6708204
  15           1  0.2236068     -0.5 -0.6708204
  16           1  0.6708204      0.5  0.2236068
  17           1 -0.6708204      0.5 -0.2236068
  18           1 -0.2236068     -0.5  0.6708204
  19           1  0.2236068     -0.5 -0.6708204
  20           1  0.6708204      0.5  0.2236068
  attr(,"assign")
  [1] 0 1 1 1
  attr(,"contrasts")
  attr(,"contrasts")$powerF
  [1] "contr.poly"

The latter two models that explicitly modeled a 3-degree polynomial did not use a contrast. Instead, the actual power values were squared and cubed.

Is there an advantage to using one over the other? Maybe we should have just used polynomial regression instead of creating an ordered factor. Fox and Weisberg (2019) tell us contr.poly() is appropriate when the levels of your ordered factor are equally spaced and balanced. That is the case for the power variable. Each level of power is separated by 20 units, and we have five observations for each level. If we have an ordered factor that does not have equal spacing and/or is not balanced, we’re better off performing polynomial regression via the poly() function.

It’s worth noting that even though we have power saved as an ordered factor and modeled with a polynomial contrast, we can still perform follow-up analyses usually associated with standard unordered categorical variables. For example, using the emmeans package we can make pairwise comparisons.


library(emmeans)
lm(rate ~ powerF, data = etch) |> 
  emmeans(specs = "powerF") |>
  pairs()


   contrast  estimate   SE df t.ratio p.value
   160 - 180    -36.2 11.6 16  -3.133 0.0294 
   160 - 200    -74.2 11.6 16  -6.422 <.0001 
   160 - 220   -155.8 11.6 16 -13.485 <.0001 
   180 - 200    -38.0 11.6 16  -3.289 0.0216 
   180 - 220   -119.6 11.6 16 -10.352 <.0001 
   200 - 220    -81.6 11.6 16  -7.063 <.0001 
  
  P value adjustment: tukey method for comparing a family of 4 estimates

Hopefully you now have a better understanding of what’s happening in R when you set a factor as ordered and use it in a linear model.


References


Clay Ford
Statistical Research Consultant
University of Virginia Library
July 01, 2021


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.