In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.

One method for estimating sample sizes for categorical outcomes is Whitehead (1993). This is the method implemented in the {Hmisc} package. Let’s work the first example presented in Whitehead (page 2260).

Consider an experiment where subjects are randomly assigned to two treatment groups—control or experimental treatment—and are assessed by a doctor at the end of the trial with an ordered outcome of *very good*, *good*, *moderate*, or *poor*. To estimate a sample size for this experiment, assuming we will analyze it with a proportional odds model, we need the following information:

**Power**: the desired probability we correctly reject the null hypothesis of no treatment effect assuming the experimental treatment really is efficacious.**Significance level**: the type I error we’re willing to assume.**Marginal probabilities**: the probability that a subject will be in a response level, averaged over the two treatment groups.**Reference improvement**: the odds ratio that expresses the improvement of experimental treatment over control.**Fraction**: the fraction of subjects allocated to the control group.

The power and significance level are the easiest to specify since we get to pick them. This example uses a power of 0.9 and a significance level of 0.05.

The marginal probabilities are the most difficult to specify. These are *projected* probabilities based on previous experience and/or pilot studies. The Whitehead example proposes the following probabilities for the control group, which we enter below as a named vector:

```
control <- c("very good" = 0.2, "good" = 0.5, "moderate" = 0.2, "poor" = 0.1)
control
```

```
very good good moderate poor
0.2 0.5 0.2 0.1
```

Taking the cumulative sum of this vector yields projected *cumulative probabilities* of having a particular outcome or *better*. For example, the probability of being assessed as good or better is 0.7. (It is certain that a subject will be assessed as poor or better, because the subject is guaranteed to be placed into one of the categories.)

```
C_control <- cumsum(control)
C_control
```

```
very good good moderate poor
0.2 0.7 0.9 1.0
```

Now let’s say it’s reasonable and achievable that the experimental treatment will increase the probability of good or better from 0.7 to 0.85. This produces an odds ratio of about 2.42, which we calculate below using the *cross-product rule*.

```
OR <- (0.85*(1-0.7))/(0.7*(1-0.85))
OR
```

```
[1] 2.428571
```

Since we plan to analyze this data as if the odds ratio is constant when comparing cumulative probabilities (i.e., the proportional odds assumption), we can use this odds ratio to derive the other cumulative probabilities for the experimental treatment group. Since the formula for the odds ratio is

\[OR = \frac{p_e(1 - p_c)}{p_c(1 - p_e)} \]

where \(p_e\) is the probability of experimental treatment and \(p_c\) is the probability of the control group, we can solve for \(p_e\) and obtain

\[p_e = \frac{ORp_c}{1 - P_c + ORp_c} \]

We can code this as a function and apply it to the projected cumulative control probabilities to obtain projected cumulative treatment probabilities.

```
f <- function(or, pc){
(or * pc)/(1 - pc + or * pc)
}
C_experimental <- sapply(C_control, f, or = OR)
C_experimental
```

```
very good good moderate poor
0.3777778 0.8500000 0.9562500 1.0000000
```

And now we can derive the category specific probabilities using the `diff()`

function. The `diff()`

function, when used with its default settings, calculates the lagged difference of adjacent elements in a vector. For example, `diff(c(5,8,4))`

returns 3 and -4 (i.e., 8 - 5 = 3 and 4 - 8 = -4). Notice we need to prepend 0 to the vector to ensure `diff()`

returns a vector the same length as `C_experimental`

.

```
treatment <- diff(c(0,C_experimental))
treatment
```

```
very good good moderate poor
0.3777778 0.4722222 0.1062500 0.0437500
```

Let’s look at both sets of probabilities in a table:

```
probs <- rbind(control, treatment)
probs
```

```
very good good moderate poor
control 0.2000000 0.5000000 0.20000 0.10000
treatment 0.3777778 0.4722222 0.10625 0.04375
```

Recall to calculate sample size we need marginal probabilities. The marginal probabilities of this table are the column averages.

```
marg_probs <- apply(probs, 2, mean)
marg_probs
```

```
very good good moderate poor
0.2888889 0.4861111 0.1531250 0.0718750
```

The last item to consider is the fraction of subjects allocated to the control group. The easiest option is 0.5. In other words, assign an equal number of subjects to each group. This is the allocation specified in the example.

Now we have everything we need to estimate the sample size for this experiment.

**Power**: 0.9**Significance level**: 0.05**Marginal probabilities**: 0.2888889, 0.4861111, 0.153125, 0.071875**Reference improvement**: 2.4285714**Fraction**: 0.5

We’ll use the `posamsize()`

function in the {Hmisc} package to make the calculations. If you don’t have this package, you’ll need to install it by running `install.packages("Hmisc")`

. The `posamsize()`

function has an argument for each of the quantities listed above. The `p`

argument is for the marginal probabilities. The `odds.ratio`

argument is for the reference improvement.

```
library(Hmisc)
posamsize(p = marg_probs, odds.ratio = OR, fraction = 0.5,
alpha = 0.05, power = 0.9)
```

```
Total sample size: 186.9
Efficiency of design compared with continuous response: 0.857
```

The result is 186.9, which we round up to 187. Assuming our projected probabilities are correct, we need 187/2 = 93.5, or 94 subjects in each group. The reported *efficiency* compares our design using an ordered categorical response with four levels to a design using a continuous response. In this case, an ordered outcome with four levels is about 86% efficient relative to a continuous outcome.

If we are constrained to a sample size and need to estimate the power of the experiment, we can use the `popower()`

function, also in the {Hmisc} package. Below we confirm that 94 subjects in each group yields an estimated power of 0.9 using our projected probabilities.

```
popower(p = marg_probs, odds.ratio = OR, n1 = 94, n2 = 94,
alpha = 0.05)
```

```
Power: 0.899
Efficiency of design compared with continuous response: 0.857
Approximate standard error of log odds ratio: 0.2744
```

It’s not always reasonable to assume that we can assign equal numbers to each group. For example, due to excessive cost or limited supply of the experimental treatment, we may need to allocate more subjects to the control group. Below we set `fraction = 2/3`

to indicate that 2/3 of subjects will be assigned to the control group, implying the control group will have twice as many subjects as the treatment group. This means we have an *allocation ratio* of 2, or A = 2.

```
posamsize(p = marg_probs, odds.ratio = OR, fraction = 2/3,
alpha = 0.05, power = 0.9)
```

```
Total sample size: 210.2
Efficiency of design compared with continuous response: 0.857
```

Notice the sample size jumps to 210.2. The number of subjects assigned to the treatment group is \(211\frac{1}{A + 1} = 211\frac{1}{3}\approx 71\).

Setting the fraction to 3/4 so that the control group has 3 times as many subjects as the treatment group increases the sample size even further. This is an allocation ratio of A = 3.

```
posamsize(p = marg_probs, odds.ratio = OR, fraction = 3/4,
alpha = 0.05, power = 0.9)
```

```
Total sample size: 249.2
Efficiency of design compared with continuous response: 0.857
```

The number of subjects assigned to the treatment group is \(250\frac{1}{A + 1} = 250\frac{1}{4}\approx 63\).

Whitehead shows that as allocation ratios increase, the number of subjects in the treatment group approach a lower limit while the number of subjects in the control group continue to grow. Because of this Whitehead states that “allocation ratios in excess of 4 would seldom be justified” (page 2264).

Often we’ll want to include multiple predictors in our proportional odds models. For example, in addition to a treatment group we may want to adjust for sex, body weight, blood pressure, or other prognostic factors. This complicates power and sample size calculations and exceeds the functionality of the {Hmisc} functions. One way to proceed is using simulation. This is where we simulate many data sets of a certain size from a proportional odds model with hypothesized parameters, fit the correct model to the data sets, and calculate the proportion of models that correctly reject the null of no effect.

Let’s demonstrate using the single treatment predictor first to ensure it agrees with the calculations from above.

The proportional odds model with a single predictor is expressed as

\[\text{logit}(P(Y \le j)) = \alpha_j - \beta_1 x \]

The \(\alpha_j\) represent the intercepts. In this case, these are the projected cumulative probabilities of the control group on the logit, or log odds, scale. Since the cumulative probability of the “poor” group is 1, we only have three intercepts. In general, if you have *J* levels of an ordered categorical variable, you will have *J-1* intercepts.

The \(\beta\) is the treatment effect. This is the odds ratio we calculated above, also on the log odds scale. It’s important to notice the minus sign in front of the coefficient. This implies positive coefficients become negative and produce *smaller* cumulative probabilities, while negative coefficients become positive and produce a *larger* cumulative probability. We hypothesized our treatment effect should produce larger cumulative probabilities, therefore we need to express the odds ratio for the treatment effect as a negative value.

The left hand side is the cumulative probability of being less than level *j*. For example, since our levels are ordered *very good*, *good*, *moderate*, and *poor*, \(P(Y \le 2)\) corresponds to the probability of *good* or *very good*.

Let’s assign our alpha and beta values. Notice we use the `qlogis()`

function to convert the cumulative probabilities to log odds and the `log()`

function to convert the odds ratio to log odds. Also notice we only want the first three levels of our cumulative probabilities.

```
J <- 4
alpha <- qlogis(C_control[1:(J-1)])
beta_1 <- -log(OR) # make negative
```

Our predictor, treatment level, takes two values: 0 for the control group and 1 for the treatment group. We can plug 0 or 1 into our model and get the original cumulative probabilities after we take the inverse logit using `plogis()`

. Notice how the cumulative probabilities *increase* for the treated group.

```
# control
plogis(alpha - beta_1 * 0)
```

```
very good good moderate
0.2 0.7 0.9
```

```
# treated
plogis(alpha - beta_1 * 1)
```

```
very good good moderate
0.3777778 0.8500000 0.9562500
```

Once we have the cumulative probabilities, we need to derive the category-specific probabilities. To do that we need to add the poor group, which has a cumulative probability of 1, and then take the differences. Once again we need to prepend 0 to the cumulative probabilities to ensure we get a vector of length four.

```
tmp <- c(plogis(alpha - beta_1 * 0), "poor" = 1)
diff(c(0, tmp))
```

```
very good good moderate poor
0.2 0.5 0.2 0.1
```

We need to do this for all 188 subjects (94 in each group). Below we generate a treatment variable for all subjects, apply the model to each subject, convert to a matrix using `do.call()`

, add the cumulative probability for the *poor* group (which is 1), and then undo the cumulative probabilities to get the category-specific probabilities. It’s a lot of work that isn’t necessary in a simple case like this, but will come in handy when we include multiple predictors.

```
trt <- rep(0:1, each = 94)
X <- lapply(trt, function(x)plogis(alpha - beta_1*x)) |>
do.call("rbind", args = _) |>
cbind("poor" = 1) |>
apply(1, function(x)diff(c(0, x))) |>
t() # transpose the matrix
head(X)
```

```
very good good moderate poor
[1,] 0.2 0.5 0.2 0.1
[2,] 0.2 0.5 0.2 0.1
[3,] 0.2 0.5 0.2 0.1
[4,] 0.2 0.5 0.2 0.1
[5,] 0.2 0.5 0.2 0.1
[6,] 0.2 0.5 0.2 0.1
```

Next we generate the ordered response using the `sample()`

function. For each subject, we sample a response according to the category-specific probabilities. After that, we combine our treatment variable and response into a data frame and set the response as an ordered factor.

```
y <- apply(X, 1, function(x)sample(x = c("very good", "good",
"moderate", "poor"),
size = 1,
prob = x))
d <- data.frame(y, trt)
d$y <- factor(d$y, levels = c("very good", "good", "moderate", "poor"),
ordered = TRUE)
```

With our data simulated, we can run the analysis. For this we use the `polr()`

function in the {MASS} package. To assess whether the null hypothesis of no treatment effect is rejected, we use the `Anova()`

function from the {car} package. This is because the `summary()`

method for polr objects does not return p-values.

```
library(MASS)
library(car)
m <- polr(y ~ trt, data = d)
out <- Anova(m)
out$`Pr(>Chisq)` < 0.05
```

```
[1] TRUE
```

Now let’s repeat this process 500 times and see how often we reject the null hypothesis of no treatment effect. We can do this using the `replicate()`

function. The result will be a vector of TRUE/FALSE values. The proportion of TRUE values is an estimate of power.

```
rout <- replicate(n = 500, expr = {
y <- apply(X, 1, function(x)sample(x = c("very good", "good",
"moderate", "poor"),
size = 1,
prob = x))
d <- data.frame(y, trt)
d$y <- factor(d$y, levels = c("very good", "good", "moderate", "poor"),
ordered = TRUE)
m <- polr(y ~ trt, data = d)
out <- Anova(m)
out$`Pr(>Chisq)` < 0.05
})
mean(rout)
```

```
[1] 0.89
```

The proportion of 0.89 is very close to 0.899, the estimated power we calculated above.

Now let’s include a second predictor in our model. This model can be stated as

\[\text{logit}(P(Y \le j)) = \alpha_j - \beta_1 x_1 - \beta_2 x_2\]

We’ll add a simple numeric predictor by sampling values from a standard normal distribution and we’ll set the \(\beta_2\) coefficient to an odds ratio of 1.3, which again needs to be on the log odds scale. Since this results in a positive coefficient, higher values of \(x_2\) will produce *smaller* cumulative probabilities.

```
x2 <- rnorm(94 * 2)
beta_2 <- log(1.3)
```

For example, when a person in the treated group has \(x_2 = 1\), their cumulative probability for an assessment of “good or better” is

```
p1 <- plogis(alpha[2] - beta_1*1 - beta_2*1)
p1
```

```
good
0.8133971
```

But when a person in the treated group has \(x_2 = 2\), their cumulative probability for an assessment of “good or better” drops to

```
p2 <- plogis(alpha[2] - beta_1*1 - beta_2*2)
p2
```

```
good
0.7702764
```

This decreases the odds of “good or better” by about 23% (i.e., 1 - 0.77 = 0.23):

```
(p2/(1 - p2))/
(p1/(1 - p1))
```

```
good
0.7692308
```

This is also what we get when exponentiate the \(\beta_2\) coefficient. (Don’t forget the minus sign. Remember, positive coefficients become negative.)

```
exp(-beta_2)
```

```
[1] 0.7692308
```

Now we’re ready to simulate data sets and fit proportional odds models. The main difference below is the use of the `mapply()`

function since we have multiple predictors to apply our model to. We set `SIMPLIFY = FALSE`

to ensure it returns a list that we can convert to a matrix. We also extract the first p-value from the call to `Anova()`

.

```
rout <- replicate(n = 500, expr = {
x2 <- rnorm(94 * 2)
X <- mapply(function(x1, x2)plogis(alpha - beta_1 * x1 - beta_2 * x2),
trt, x2, SIMPLIFY = FALSE) |>
do.call("rbind", args = _) |>
cbind("poor" = 1) |>
apply(1, function(x)diff(c(0, x))) |>
t()
y <- apply(X, 1, function(x)sample(x = c("very good", "good",
"moderate", "poor"),
size = 1,
prob = x))
d <- data.frame(y, trt, x2)
d$y <- factor(d$y, levels = c("very good", "good", "moderate", "poor"))
m <- polr(y ~ trt + x2, data = d)
out <- Anova(m)
out$`Pr(>Chisq)`[1] < 0.05
})
mean(rout)
```

```
[1] 0.9
```

In this case the estimated power of 0.9 is about the same as the original power we calculated.

## References

- Fox J, Weisberg S (2019).
*An R Companion to Applied Regression*, Third edition. Sage, Thousand Oaks CA. https://socialsciences.mcmaster.ca/jfox/Books/Companion/. - Harrell Jr F (2024).
*Hmisc: Harrell Miscellaneous*. R package version 5.1-3, https://CRAN.R-project.org/package=Hmisc. - R Core Team (2024).
*R: A Language and Environment for Statistical Computing*. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. - Venables, W. N. & Ripley, B. D. (2002).
*Modern Applied Statistics with S*Fourth Edition. Springer, New York. ISBN 0-387-95457-0 - Whitehead, J (1993). Sample size calculations for ordered categorical data.
*Stat in Med*12:2257–2271.

*Clay Ford*

*Research Data Scientist*

*University of Virginia Library*

*June 7, 2024*

**For questions or clarifications regarding this article, contact statlab@virginia.edu.**

View the entire collection of UVA Library StatLab articles, or learn how to cite.