Power and Sample Size Calculations for Ordered Categorical Data

In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.

One method for estimating sample sizes for categorical outcomes is Whitehead (1993). This is the method implemented in the {Hmisc} package. Let’s work the first example presented in Whitehead (page 2260).

Consider an experiment where subjects are randomly assigned to two treatment groups—control or experimental treatment—and are assessed by a doctor at the end of the trial with an ordered outcome of very good, good, moderate, or poor. To estimate a sample size for this experiment, assuming we will analyze it with a proportional odds model, we need the following information:

Power: the desired probability we correctly reject the null hypothesis of no treatment effect assuming the experimental treatment really is efficacious.
Significance level: the type I error we’re willing to assume.
Marginal probabilities: the probability that a subject will be in a response level, averaged over the two treatment groups.
Reference improvement: the odds ratio that expresses the improvement of experimental treatment over control.
Fraction: the fraction of subjects allocated to the control group.

The power and significance level are the easiest to specify since we get to pick them. This example uses a power of 0.9 and a significance level of 0.05.

The marginal probabilities are the most difficult to specify. These are projected probabilities based on previous experience and/or pilot studies. The Whitehead example proposes the following probabilities for the control group, which we enter below as a named vector:


control <- c("very good" = 0.2, "good" = 0.5, "moderate" = 0.2, "poor" = 0.1)
control


very good      good  moderate      poor 
      0.2       0.5       0.2       0.1

Taking the cumulative sum of this vector yields projected cumulative probabilities of having a particular outcome or better. For example, the probability of being assessed as good or better is 0.7. (It is certain that a subject will be assessed as poor or better, because the subject is guaranteed to be placed into one of the categories.)


C_control <- cumsum(control)
C_control


very good      good  moderate      poor 
      0.2       0.7       0.9       1.0

Now let’s say it’s reasonable and achievable that the experimental treatment will increase the probability of good or better from 0.7 to 0.85. This produces an odds ratio of about 2.42, which we calculate below using the cross-product rule.


OR <- (0.85*(1-0.7))/(0.7*(1-0.85))
OR


[1] 2.428571

Since we plan to analyze this data as if the odds ratio is constant when comparing cumulative probabilities (i.e., the proportional odds assumption), we can use this odds ratio to derive the other cumulative probabilities for the experimental treatment group. Since the formula for the odds ratio is

\[OR = \frac{p_e(1 - p_c)}{p_c(1 - p_e)} \]

where \(p_e\) is the probability of experimental treatment and \(p_c\) is the probability of the control group, we can solve for \(p_e\) and obtain

\[p_e = \frac{ORp_c}{1 - P_c + ORp_c} \]

We can code this as a function and apply it to the projected cumulative control probabilities to obtain projected cumulative treatment probabilities.


f <- function(or, pc){
  (or * pc)/(1 - pc + or * pc)
}
C_experimental <- sapply(C_control, f, or = OR)
C_experimental


very good      good  moderate      poor 
0.3777778 0.8500000 0.9562500 1.0000000

And now we can derive the category specific probabilities using the diff() function. The diff() function, when used with its default settings, calculates the lagged difference of adjacent elements in a vector. For example, diff(c(5,8,4)) returns 3 and -4 (i.e., 8 - 5 = 3 and 4 - 8 = -4). Notice we need to prepend 0 to the vector to ensure diff() returns a vector the same length as C_experimental.


treatment <- diff(c(0,C_experimental))
treatment


very good      good  moderate      poor 
0.3777778 0.4722222 0.1062500 0.0437500

Let’s look at both sets of probabilities in a table:


probs <- rbind(control, treatment)
probs


          very good      good moderate    poor
control   0.2000000 0.5000000  0.20000 0.10000
treatment 0.3777778 0.4722222  0.10625 0.04375

Recall to calculate sample size we need marginal probabilities. The marginal probabilities of this table are the column averages.


marg_probs <- apply(probs, 2, mean)
marg_probs


very good      good  moderate      poor 
0.2888889 0.4861111 0.1531250 0.0718750

The last item to consider is the fraction of subjects allocated to the control group. The easiest option is 0.5. In other words, assign an equal number of subjects to each group. This is the allocation specified in the example.

Now we have everything we need to estimate the sample size for this experiment.

Power: 0.9
Significance level: 0.05
Marginal probabilities: 0.2888889, 0.4861111, 0.153125, 0.071875
Reference improvement: 2.4285714
Fraction: 0.5

We’ll use the posamsize() function in the {Hmisc} package to make the calculations. If you don’t have this package, you’ll need to install it by running install.packages("Hmisc"). The posamsize() function has an argument for each of the quantities listed above. The p argument is for the marginal probabilities. The odds.ratio argument is for the reference improvement.


library(Hmisc)
posamsize(p = marg_probs, odds.ratio = OR, fraction = 0.5, 
          alpha = 0.05, power = 0.9)


Total sample size: 186.9 
Efficiency of design compared with continuous response: 0.857

The result is 186.9, which we round up to 187. Assuming our projected probabilities are correct, we need 187/2 = 93.5, or 94 subjects in each group. The reported efficiency compares our design using an ordered categorical response with four levels to a design using a continuous response. In this case, an ordered outcome with four levels is about 86% efficient relative to a continuous outcome.

If we are constrained to a sample size and need to estimate the power of the experiment, we can use the popower() function, also in the {Hmisc} package. Below we confirm that 94 subjects in each group yields an estimated power of 0.9 using our projected probabilities.


popower(p = marg_probs, odds.ratio = OR, n1 = 94, n2 = 94, 
        alpha = 0.05)


Power: 0.899 
Efficiency of design compared with continuous response: 0.857 
Approximate standard error of log odds ratio: 0.2744

It’s not always reasonable to assume that we can assign equal numbers to each group. For example, due to excessive cost or limited supply of the experimental treatment, we may need to allocate more subjects to the control group. Below we set fraction = 2/3 to indicate that 2/3 of subjects will be assigned to the control group, implying the control group will have twice as many subjects as the treatment group. This means we have an allocation ratio of 2, or A = 2.


posamsize(p = marg_probs, odds.ratio = OR, fraction = 2/3, 
          alpha = 0.05, power = 0.9)


Total sample size: 210.2 
Efficiency of design compared with continuous response: 0.857

Notice the sample size jumps to 210.2. The number of subjects assigned to the treatment group is \(211\frac{1}{A + 1} = 211\frac{1}{3}\approx 71\).

Setting the fraction to 3/4 so that the control group has 3 times as many subjects as the treatment group increases the sample size even further. This is an allocation ratio of A = 3.


posamsize(p = marg_probs, odds.ratio = OR, fraction = 3/4, 
          alpha = 0.05, power = 0.9)


Total sample size: 249.2 
Efficiency of design compared with continuous response: 0.857

The number of subjects assigned to the treatment group is \(250\frac{1}{A + 1} = 250\frac{1}{4}\approx 63\).

Whitehead shows that as allocation ratios increase, the number of subjects in the treatment group approach a lower limit while the number of subjects in the control group continue to grow. Because of this Whitehead states that “allocation ratios in excess of 4 would seldom be justified” (page 2264).

Often we’ll want to include multiple predictors in our proportional odds models. For example, in addition to a treatment group we may want to adjust for sex, body weight, blood pressure, or other prognostic factors. This complicates power and sample size calculations and exceeds the functionality of the {Hmisc} functions. One way to proceed is using simulation. This is where we simulate many data sets of a certain size from a proportional odds model with hypothesized parameters, fit the correct model to the data sets, and calculate the proportion of models that correctly reject the null of no effect.

Let’s demonstrate using the single treatment predictor first to ensure it agrees with the calculations from above.

The proportional odds model with a single predictor is expressed as

\[\text{logit}(P(Y \le j)) = \alpha_j - \beta_1 x \]

The \(\alpha_j\) represent the intercepts. In this case, these are the projected cumulative probabilities of the control group on the logit, or log odds, scale. Since the cumulative probability of the “poor” group is 1, we only have three intercepts. In general, if you have J levels of an ordered categorical variable, you will have J-1 intercepts.

The \(\beta\) is the treatment effect. This is the odds ratio we calculated above, also on the log odds scale. It’s important to notice the minus sign in front of the coefficient. This implies positive coefficients become negative and produce smaller cumulative probabilities, while negative coefficients become positive and produce a larger cumulative probability. We hypothesized our treatment effect should produce larger cumulative probabilities, therefore we need to express the odds ratio for the treatment effect as a negative value.

The left hand side is the cumulative probability of being less than level j. For example, since our levels are ordered very good, good, moderate, and poor, \(P(Y \le 2)\) corresponds to the probability of good or very good.

Let’s assign our alpha and beta values. Notice we use the qlogis() function to convert the cumulative probabilities to log odds and the log() function to convert the odds ratio to log odds. Also notice we only want the first three levels of our cumulative probabilities.


J <- 4
alpha <- qlogis(C_control[1:(J-1)])
beta_1 <- -log(OR) # make negative

Our predictor, treatment level, takes two values: 0 for the control group and 1 for the treatment group. We can plug 0 or 1 into our model and get the original cumulative probabilities after we take the inverse logit using plogis(). Notice how the cumulative probabilities increase for the treated group.


# control
plogis(alpha - beta_1 * 0)


very good      good  moderate 
      0.2       0.7       0.9


# treated
plogis(alpha - beta_1 * 1)


very good      good  moderate 
0.3777778 0.8500000 0.9562500

Once we have the cumulative probabilities, we need to derive the category-specific probabilities. To do that we need to add the poor group, which has a cumulative probability of 1, and then take the differences. Once again we need to prepend 0 to the cumulative probabilities to ensure we get a vector of length four.


tmp <- c(plogis(alpha - beta_1 * 0), "poor" = 1)
diff(c(0, tmp))


very good      good  moderate      poor 
      0.2       0.5       0.2       0.1

We need to do this for all 188 subjects (94 in each group). Below we generate a treatment variable for all subjects, apply the model to each subject, convert to a matrix using do.call(), add the cumulative probability for the poor group (which is 1), and then undo the cumulative probabilities to get the category-specific probabilities. It’s a lot of work that isn’t necessary in a simple case like this, but will come in handy when we include multiple predictors.


trt <- rep(0:1, each = 94)
X <- lapply(trt, function(x)plogis(alpha - beta_1*x)) |> 
  do.call("rbind", args = _) |> 
  cbind("poor" = 1) |> 
  apply(1, function(x)diff(c(0, x))) |> 
  t() # transpose the matrix
head(X)


     very good good moderate poor
[1,]       0.2  0.5      0.2  0.1
[2,]       0.2  0.5      0.2  0.1
[3,]       0.2  0.5      0.2  0.1
[4,]       0.2  0.5      0.2  0.1
[5,]       0.2  0.5      0.2  0.1
[6,]       0.2  0.5      0.2  0.1

Next we generate the ordered response using the sample() function. For each subject, we sample a response according to the category-specific probabilities. After that, we combine our treatment variable and response into a data frame and set the response as an ordered factor.


y <- apply(X, 1, function(x)sample(x = c("very good", "good", 
                                         "moderate", "poor"), 
                                   size = 1, 
                                   prob = x))

d <- data.frame(y, trt)
d$y <- factor(d$y, levels = c("very good", "good", "moderate", "poor"), 
              ordered = TRUE)

With our data simulated, we can run the analysis. For this we use the polr() function in the {MASS} package. To assess whether the null hypothesis of no treatment effect is rejected, we use the Anova() function from the {car} package. This is because the summary() method for polr objects does not return p-values.


library(MASS)
library(car)
m <- polr(y ~ trt, data = d)
out <- Anova(m)
out$`Pr(>Chisq)` < 0.05


[1] TRUE

Now let’s repeat this process 500 times and see how often we reject the null hypothesis of no treatment effect. We can do this using the replicate() function. The result will be a vector of TRUE/FALSE values. The proportion of TRUE values is an estimate of power.


rout <- replicate(n = 500, expr = {
  y <- apply(X, 1, function(x)sample(x = c("very good", "good", 
                                         "moderate", "poor"), 
                                     size = 1,
                                     prob = x))
  d <- data.frame(y, trt)
  d$y <- factor(d$y, levels = c("very good", "good", "moderate", "poor"), 
                ordered = TRUE)
  m <- polr(y ~ trt, data = d)
  out <- Anova(m)
  out$`Pr(>Chisq)` < 0.05
})
mean(rout)


[1] 0.89

The proportion of 0.89 is very close to 0.899, the estimated power we calculated above.

Now let’s include a second predictor in our model. This model can be stated as

\[\text{logit}(P(Y \le j)) = \alpha_j - \beta_1 x_1 - \beta_2 x_2\]

We’ll add a simple numeric predictor by sampling values from a standard normal distribution and we’ll set the \(\beta_2\) coefficient to an odds ratio of 1.3, which again needs to be on the log odds scale. Since this results in a positive coefficient, higher values of \(x_2\) will produce smaller cumulative probabilities.


x2 <- rnorm(94 * 2)
beta_2 <- log(1.3)

For example, when a person in the treated group has \(x_2 = 1\), their cumulative probability for an assessment of “good or better” is


p1 <- plogis(alpha[2] - beta_1*1 - beta_2*1)
p1


     good 
0.8133971

But when a person in the treated group has \(x_2 = 2\), their cumulative probability for an assessment of “good or better” drops to


p2 <- plogis(alpha[2] - beta_1*1 - beta_2*2)
p2


     good 
0.7702764

This decreases the odds of “good or better” by about 23% (i.e., 1 - 0.77 = 0.23):


(p2/(1 - p2))/
  (p1/(1 - p1))


     good 
0.7692308

This is also what we get when exponentiate the \(\beta_2\) coefficient. (Don’t forget the minus sign. Remember, positive coefficients become negative.)


exp(-beta_2)


[1] 0.7692308

Now we’re ready to simulate data sets and fit proportional odds models. The main difference below is the use of the mapply() function since we have multiple predictors to apply our model to. We set SIMPLIFY = FALSE to ensure it returns a list that we can convert to a matrix. We also extract the first p-value from the call to Anova().


rout <- replicate(n = 500, expr = {
  x2 <- rnorm(94 * 2)
  X <- mapply(function(x1, x2)plogis(alpha - beta_1 * x1 - beta_2 * x2), 
              trt, x2, SIMPLIFY = FALSE) |> 
    do.call("rbind", args = _) |> 
    cbind("poor" = 1) |> 
    apply(1, function(x)diff(c(0, x))) |> 
    t()
  y <- apply(X, 1, function(x)sample(x = c("very good", "good", 
                                           "moderate", "poor"), 
                                     size = 1, 
                                     prob = x))
  d <- data.frame(y, trt, x2)
  d$y <- factor(d$y, levels = c("very good", "good", "moderate", "poor"))
  m <- polr(y ~ trt + x2, data = d)
  out <- Anova(m)
  out$`Pr(>Chisq)`[1] < 0.05
})
mean(rout)


[1] 0.9

In this case the estimated power of 0.9 is about the same as the original power we calculated.

References

Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/.
Harrell Jr F (2024). Hmisc: Harrell Miscellaneous. R package version 5.1-3, https://CRAN.R-project.org/package=Hmisc.
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S Fourth Edition. Springer, New York. ISBN 0-387-95457-0
Whitehead, J (1993). Sample size calculations for ordered categorical data. Stat in Med 12:2257–2271.