The power of a test is the probability of correctly rejecting a null hypothesis. For example, let’s say we suspect a coin is not fair and lands heads 65% of the time.

- The null hypothesis is the coin is not biased to land heads.
- The alternative hypothesis is the coin is biased to land heads.

We plan to flip the coin 50 times and calculate the proportion of heads. We’ll analyze the data using a one-sample proportion test. We’ll reject the null hypothesis if our *p*-value is below 0.05. Assuming our hunch is correct, that the coin lands heads 65% of the time, what is the probability of correctly rejecting the null hypothesis? That’s power. Ideally we want power to be high, say, greater than 0.90.

In simple experiments such as this, it is relatively easy to calculate power. In R we can use the `pwr.p.test()`

function in the pwr package. The `h`

argument is for effect size. We use the `ES.h()`

function to calculate effect size of 0.65 versus 0.50. We also specify our sample size, our significance level, and the direction of our alternative hypothesis.

```
# install.packages("pwr")
library(pwr)
pwr.p.test(h = ES.h(p1 = 0.65, p2 = 0.50),
n = 50, sig.level = 0.05,
alternative = "greater")
```

```
proportion power calculation for binomial distribution (arcsine transformation)
h = 0.3046927
n = 50
sig.level = 0.05
power = 0.6948512
alternative = greater
```

The power is computed to be about 0.70. There’s a decent, but not great, chance we’ll reject the null hypothesis (assuming the coin really is biased to land heads 65% of the time). How many times should we flip the coin to have a power of 0.90? We can rerun the same code, but this time specify `power`

and leave out `n`

.

```
pwr.p.test(h = ES.h(p1 = 0.65, p2 = 0.50),
power = 0.90, sig.level = 0.05,
alternative = "greater")
```

```
proportion power calculation for binomial distribution (arcsine transformation)
h = 0.3046927
n = 92.24545
sig.level = 0.05
power = 0.9
alternative = greater
```

It looks like we need to flip our coin at least 93 times to have a 90% chance of rejecting the null hypothesis. Once again, this assumes the coin really is loaded to land heads 65% of the time.

We could also calculate power and sample size for this experiment using simulation. The basic idea is to simulate running the experiment many times and calculate the proportion of times we reject the null hypothesis. This proportion provides an estimate of power. Let’s try that.

Below we simulate flipping a biased coin 50 times using the `sample()`

function. Notice we use the `prob`

argument to specify that “H” is sampled 65% of the time. Next we run a one-sample proportion test to see if we have evidence of a coin biased to land heads. If the *p*-value is below 0.05, then we reject the null hypothesis.

```
d <- sample(c("H","T"), size = 50, replace = TRUE, prob = c(0.65, 0.35))
test <- prop.test(x = sum(d == "H"), n = 50, alternative = "greater")
test$p.value < 0.05
```

```
[1] TRUE
```

We can replicate this code many times using the `replicate()`

function. Simply wrap the code in curly brackets and specify how many times you want to replicate the code. Below we run our code 1000 times and save the result as `sim`

. The result is a vector of `TRUE`

/`FALSE`

values. Taking the mean of `TRUE`

/`FALSE`

values returns the proportion of `TRUE`

s. In this case, the proportion of `TRUE`

s will be an estimate of power.

```
sim <- replicate(n = 1000, expr = {
d <- sample(c("H","T"), size = 50,
replace = TRUE, prob = c(0.65, 0.35))
test <- prop.test(x = sum(d == "H"), n = 50,
alternative = "greater")
test$p.value < 0.05
})
mean(sim)
```

```
[1] 0.625
```

The estimate is a little over 0.60. The reason that’s lower than what we calculated before is because the `prop.test()`

function uses something called Yates’s continuity correction by default. If we turn that off by setting `correct = FALSE`

, we get a similar result as before.

```
sim <- replicate(n = 1000, expr = {
d <- sample(c("H","T"), size = 50,
replace = TRUE, prob = c(0.65, 0.35))
test <- prop.test(x = sum(d == "H"), n = 50,
alternative = "greater",
correct = FALSE)
test$p.value < 0.05
})
mean(sim)
```

```
[1] 0.717
```

To try different sample sizes, we can create a function that allows us to set the sample size, and then run the function with various sample sizes to see how power changes. Below we create a function called `sim_power()`

that is basically a copy of our previous code. We simply replaced `50`

with `n`

.

```
sim_power <- function(n){
sim <- replicate(n = 1000, expr = {
d <- sample(c("H","T"), size = n,
replace = TRUE, prob = c(0.65, 0.35))
test <- prop.test(x = sum(d == "H"), n = n,
alternative = "greater",
correct = FALSE)
test$p.value < 0.05
})
mean(sim)
}
```

Now we can “apply” this function to various sample sizes and see how power changes. The `sapply()`

function simplifies the output to a vector.

```
sample_size <- c(50, 75, 100, 125)
power <- sapply(sample_size, sim_power)
data.frame(sample_size, power)
```

```
sample_size power
1 50 0.746
2 75 0.842
3 100 0.914
4 125 0.963
```

It looks like we would want to do around 100 flips for about 90% power, which agrees with our previous result using the `pwr.p.test()`

function. If you run this code, you’ll get slightly different answers because we’re using simulated data.

For a simple experiment such as coin flipping, simulation isn’t necessary to determine power and sample size. But simulation can be very helpful when your experiment is more complicated.

Let’s say we want to investigate the effects of different weights of paper and different designs of paper airplanes on the distance flown in inches. Perhaps we have two designs (a and b) and two weights of paper (20 lb and 28 lb). That’s 2 x 2 = 4 combinations of paper airplanes. What power do we have to detect a difference of 12 inches for design “b” with a significance level of 0.01, assuming a sample size of 10 per group and a common standard deviation of 10 inches between the four groups?

To begin, let’s just simulate one data set and one analysis.

```
n <- 10 # per group
design <- rep(c("a","b"), each = n*2)
weight <- rep(c("20 lb","28 lb"), times = n*2)
table(design, weight)
```

```
weight
design 20 lb 28 lb
a 10 10
b 10 10
```

Next we simulate our response according to the following assumptions about expected flight distance. Perhaps these are based on a small pilot study.

- design a, 20 lb paper: 72 inches of flight
- design b, 20 lb paper: 72 + 12
- design a, 28 lb paper: 72 + -4
- design b, 28 lb paper: 72 + 12 + -4

These are known as *additive* effects. The effect of design b is to add 12 inches to expected flight distance. The effect of 28 lb paper is to add -4 inches to expected flight distance.

The `rnorm()`

function at the end adds “noise” to the distance and is analogous to setting the standard deviation for each design/weight combination to 10.

```
dist <- 72 +
(design == "b")*12 +
(weight == "28 lb")*-4 +
rnorm(n*4, mean = 0, sd = 10)
```

And finally we run the analysis using the `lm()`

function.

```
m <- lm(dist ~ design + weight)
```

We can extract the hypothesis test for `designb`

and check to see if it’s less than 0.01 as follows:

```
coef(summary(m))['designb','Pr(>|t|)'] < 0.01
```

```
[1] TRUE
```

Now repeat the simulation and analysis 1000 times.

```
results <- replicate(1000, expr = {
dist <- 72 + (design == "b")*12 +
(weight == "28 lb")*-4 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight)
coef(summary(m))['designb','Pr(>|t|)'] < 0.01
})
mean(results)
```

```
[1] 0.859
```

That’s fairly high power. If our assumptions are correct, 10 observations of each of the four different airplane designs provide a high probability of correctly rejecting the null of no design effect at a significance level of 0.01.

What if we wanted to detect an *interaction* between design and weight? Interactions imply the effects are *not* additive. The effect of design depends on weight and vice versa.

Let’s say it would be of interest if the interaction was 6 inches. For example, the effect of design b is 12 inches for 20 lb paper, but 12 + 6 = 18 inches for 28 lb paper.

We might simulate our response according to the following:

- design a, 20 lb paper: 72
- design b, 20 lb paper: 72 + 12
- design a, 28 lb paper: 72 + -4
- design b, 28 lb paper: 72 + 12 + -4 + 6

To specify the interaction in `lm()`

, we use the `:`

operator.

```
dist <- 72 +
(design == "b")*12 +
(weight == "28 lb")*-4 +
(design == "b")*(weight == "28 lb")*6 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28 lb','Pr(>|t|)'] < 0.01
```

```
[1] FALSE
```

Let’s run it 1000 times and see what proportion of times we reject the null hypothesis of no interaction:

```
results <- replicate(1000, expr = {
dist <- 72 + (design == "b")*12 + (weight == "28 lb")*-4 +
(design == "b")*(weight == "28 lb")*6 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28 lb','Pr(>|t|)'] < 0.01
})
mean(results)
```

```
[1] 0.045
```

The estimated power is very low. We need a much bigger sample size to detect this interaction.

Let’s turn this into a function so we can easily try different sample sizes.

```
sim_model <- function(n){
results <- replicate(n = 1000, expr = {
design <- rep(c("a","b"), each = n*2)
weight <- rep(c("20 lb","28 lb"), times = n*2)
dist <- 72 + (design == "b")*12 + (weight == "28 lb")*-4 +
(design == "b")*(weight == "28 lb")*6 +
rnorm(n*4, mean = 0, sd = 10)
m <- lm(dist ~ design + weight + design:weight)
coef(summary(m))['designb:weight28 lb','Pr(>|t|)'] < 0.01})
mean(results)
}
# try samples sizes of 50, 100, 150
sample_size <- c(50,100,150)
power <- sapply(sample_size, sim_model)
data.frame(sample_size, power)
```

```
sample_size power
1 50 0.279
2 100 0.671
3 150 0.867
```

It appears we need about 15 times the sample size (150 per group versus 10 per group) to detect the hypothesized interaction with decent power! (See this blog post for more insight on sample sizes for interactions.)

Again, this is all one big thought experiment based on our assumptions. It doesn’t guarantee we’ll find large effects and get “significant” results. It simply gives us an approximate sample size to aim for given what we believe might be true about our paper airplane designs.

If we wanted, we could build in more assumptions, such as unequal variances between groups, unequal sample sizes per group, block effects to account for different people throwing the airplanes, and so forth. This would make our simulation and subsequent analysis more complicated, but it may be necessary to get accurate estimates of power and sample size. Simulating data forces us to think about how we’re going to analyze our data, which in turn makes us confront our assumptions about how the data are generated.

## References

- Champely, S. (2020).
*pwr: Basic functions for power analysis*(Version 1.3-0). https://CRAN.R-project.org/package=pwr - Gelman, A., & Hill, J. (2007).
*Data analysis using regression and multilevel/hierarchical models*. Cambridge University Press. Chapter 20. - R Core Team. (2021).
*R: A language and environment for statistical computing*. R Foundation for Statistical Computing. https://www.R-project.org/

Clay Ford

Statistical Research Consultant

University of Virginia Library

October 01, 2021

**For questions or clarifications regarding this article, contact statlab@virginia.edu.**

View the entire collection of UVA Library StatLab articles, or learn how to cite.