When designing an experiment it’s good practice to estimate the number of subjects or observations we’ll need. If we recruit or collect too few, our analysis may be too uncertain or misleading. If we collect too many, we potentially waste time and expense on diminishing returns. The optimal sample size provides enough information to allow us to analyze our research questions with confidence. The traditional approach to sample size estimation is based on hypothesis tests. In this article we present sample size estimation based on confidence intervals.
To motivate the discussion, we’ll generate fake data for an imaginary experiment. Let’s say we ask a female person and a male person to randomly ask strangers entering a grocery store to take a brief survey on their shopping habits. Of interest is whether or not the sex of the person administering the survey is associated with the proportion of people who agree to take the survey. We’ll assume females get people to take the survey 40% of the time while males get people to take the survey 30% of the time. This is our hypothesized effect size. We're not sure it's true, but if it is, we would like our experiment to detect it.
Below we use the rbinom()
function to simulate data for this experiment. This generates two random sequences of zeroes and ones from a binomial distribution. A one indicates an event occurred. A zero indicates an event did not occur. The prob
argument allows us to specify the probability an event occurs. We set n = 100
to generate 100 observations for each person. Finally we take the mean of each sequence to get the proportion of ones. Recall that taking the mean of 0/1 data is equivalent to calculating the proportion of ones.
set.seed(999) # run this line to generate the same "random" data
g1 <- rbinom(n = 100, size = 1, prob = 0.4)
g2 <- rbinom(n = 100, size = 1, prob = 0.3)
c(mean(g1), mean(g2))
[1] 0.39 0.31
The difference in our observed proportions is 0.08.
mean(g1) - mean(g2)
[1] 0.08
Of course the “true” difference is 0.4 - 0.3 = 0.1.
Now let’s analyze the data using the prop.test()
function. This allows us to perform a two-sample proportion test. The null hypothesis is that the proportions for the two groups are equivalent. Since we simulated the data we already know the proportions are different. What does prop.test()
think? To use prop.test()
we need to provide the number of “successes” (ie, number of ones) to the x
argument and the number of trials to the n
argument. We set correct = FALSE
to suppress the continuity correction.
prop.test(x = c(sum(g1), sum(g2)), n = c(100, 100),
correct = FALSE)
2-sample test for equality of proportions without continuity correction
data: c(sum(g1), sum(g2)) out of c(100, 100)
X-squared = 1.4066, df = 1, p-value = 0.2356
alternative hypothesis: two.sided
95 percent confidence interval:
-0.05174108 0.21174108
sample estimates:
prop 1 prop 2
0.39 0.31
The result of the hypothesis test is a p-value of 0.2356. This says the probability of getting data this different, or more different, assuming there is no difference in the proportions is about 0.24. This probability is pretty high compared to traditional significance levels such as 0.01 or 0.05.
The prop.test()
function also calculates a 95% confidence interval on the difference in proportions using a normal approximation, sometimes referred to as the “Wald” method. The reported confidence interval of [-0.05, 0.21] overlaps 0, which tells us we can’t be too certain that one group has a higher proportion than the other.
Obviously the p-value and confidence interval are wrong. We know the population proportion values are different. But our sample size of 100 is too small for this difference to consistently reveal itself. So how large a sample should we collect? A common approach is to determine the sample size that would reliably reject a null hypothesis test at a stated significance level, such as 0.05. The power.prop.test()
function allows us to do this if we plan to compare the proportions using a two-sample proportion test. Below we plug in our hypothesized proportions, the desired significance level of 0.05, and a power of 0.9.
power.prop.test(p1 = 0.4, p2 = 0.3, sig.level = 0.05, power = 0.9)
Two-sample comparison of proportions power calculation
n = 476.0072
p1 = 0.4
p2 = 0.3
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
Assuming the stated difference in proportions is real, a sample size of 477 per group gives us a probability of 0.9 of rejecting the null at a 0.05 significance level. (Estimated sample sizes are always rounded up.)
But what if instead of simply rejecting the null hypothesis we wanted to estimate the difference in proportions with a desired precision? In our example above, the width of the 95% confidence interval on the difference in proportions was pretty wide, about 0.26. We can calculate that as follows:
prop_out <- prop.test(x = c(sum(g1), sum(g2)), n = c(100, 100),
correct = FALSE)
diff(prop_out$conf.int)
[1] 0.2634822
What sample size do we need to achieve a tighter confidence interval width, such as 0.1? This is almost certainly more valuable than simply declaring two proportions different. This would allow us to draw conclusions about the direction and magnitude of the difference.
Let’s look at the formula for the confidence interval on a difference in proportions with equal sample sizes.
\[p1 - p2 \pm z_{\alpha/2}\sqrt{\frac{p1(1-p1)}{n} + \frac{p2(1-p2)}{n}}\] The margin of error is the second half of the formula, which we’ll call \(\epsilon\). The term \(z_{\alpha/2}\) is the normal quantile. For a 95% confidence interval this is about 1.96. The square root term is the standard error.
\[\epsilon = z_{\alpha/2}\sqrt{\frac{p1(1-p1)}{n} + \frac{p2(1-p2)}{n}} \] Using some algebra we can solve for n and obtain the following:
\[n = \frac{z_{\alpha/2}^2[p1(1-p1) + p2(1-p2)]}{\epsilon^2} \]
Let’s say we wanted a confidence width of about 0.1. This implies a margin of error of 0.1/2 = 0.05. The necessary sample size for hypothesized proportions of 0.3 and 0.4 would be about 692.
p1 <- 0.3
p2 <- 0.4
e <- 0.1/2
z <- qnorm(p = 0.975) # about 1.96
(z^2) * (p1 * (1 - p1) + p2 * (1 - p2))/e^2
[1] 691.4626
Notice this is much bigger than the power-based sample size estimate. This is because we’re planning to estimate the difference in proportions with precision, not just reject the null hypothesis at an arbitrary significance level.
The presize package (Haynes et al. 2021) provides the prec_riskdiff()
function to do this calculation for us. (The term “risk difference” is often used to describe a difference in proportions.) Simply provide the hypothesized proportions and the desired width of the confidence interval. We also specify method = "wald"
to replicate our calculations above.
library(presize)
prec_riskdiff(p1 = 0.3, p2 = 0.4, conf.width = 0.1, method = "wald")
sample size for a risk difference with wald confidence interval
p1 p2 n1 n2 ntot r delta lwr upr conf.width conf.level
1 0.3 0.4 691.4626 691.4626 1382.925 1 -0.1 -0.15 -0.05 0.1 0.95
The n1 and n2 columns report the required sample size. The r column is the allocation ratio. The default is 1, which means equal group sizes.
It’s worth noting that \(p (1 - p)\) will never exceed 0.25 no matter the value for p. Therefore we could modify the formula to estimate a sample size for a given precision assuming no knowledge about the proportions:
\[n = \frac{z_{\alpha/2}^2[0.25 + 0.25]}{\epsilon^2} = \frac{z_{\alpha/2}^2}{2\epsilon^2} \]
This produces a larger sample size estimate of about 769:
(z^2) / (2 * e^2)
[1] 768.2918
We can get the same result using prec_riskdiff()
by setting both p1
and p2
equal to 0.5:
prec_riskdiff(p1 = 0.5, p2 = 0.5, conf.width = 0.1, method = "wald")
sample size for a risk difference with wald confidence interval
p1 p2 n1 n2 ntot r delta lwr upr conf.width conf.level
1 0.5 0.5 768.2918 768.2918 1536.584 1 0 -0.05 0.05 0.1 0.95
Let’s simulate data with 692 observations and see whether or not the width of the confidence interval is within 0.1.
g1 <- rbinom(n = 692, size = 1, prob = 0.4)
g2 <- rbinom(n = 692, size = 1, prob = 0.3)
ci <- prop.test(x = c(sum(g1), sum(g2)), n = c(692, 692),
correct = FALSE)$conf.int
diff(ci)
[1] 0.0997548
Using the replicate()
function we can do this, say, 2000 times and see how often the width of the confidence interval is less than or equal to 0.1.
widths <- replicate(n = 2000, expr = {
g1 <- rbinom(n = 692, size = 1, prob = 0.4)
g2 <- rbinom(n = 692, size = 1, prob = 0.3)
ci <- prop.test(x = c(sum(g1), sum(g2)), n = c(692, 692),
correct = FALSE)$conf.int
diff(ci)
})
mean(widths <= 0.1)
[1] 0.5405
summary(widths)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09720 0.09933 0.09990 0.09989 0.10048 0.10230
It’s not a guarantee, but more often than not the width of the confidence interval is less than 0.1, and never much higher.
The presize package offers over a dozen functions for estimating precision-based sample sizes for various statistics. Each function is well documented and includes helpful examples.
For more information on the motivation behind the presize package and precision-based sample size estimates in general, see Bland (2009).
References
- Bland J M. (2009) The tyranny of power: is there a better way to calculate sample size? BMJ 2009; 339:b3985 https://www.bmj.com/content/339/bmj.b3985
- Haynes et al., (2021). presize: An R-package for precision-based sample size calculation in clinical research. Journal of Open Source Software, 6(60), 3118, https://doi.org/10.21105/joss.03118
- Hogg, R and Tanis, E. (2006). Probability and Statistical Inference (7th ed.). Pearson. (Ch. 6)
- R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Clay Ford
Statistical Research Consultant
University of Virginia Library
February 20, 2023
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.