Understanding Dunnett’s Test

Multiple comparison procedures are fundamental in experimental research. Dunnett’s test, which compares multiple treatments to a single control, is particularly common in laboratory studies. When multiple comparisons are made, proper statistical methods are essential to control false positives. This article demonstrates how the number of comparisons affects p-values in Dunnett’s test, implements the procedure in R, and discusses strategies to improve statistical power.

The puzzle

Students in a chemistry lab came to us with a question about post hoc tests. They tested 20-30 compounds and repeated the experiment three times on different days. The data were analyzed using a one-way ANOVA, and then t-tests between each compound and the control to determine whether the differences were significant—that is, 20-30 comparisons against the control. Later, they learned that individual t-tests were not appropriate, so they used Dunnett’s test instead. However, they were puzzled: after adding just one more compound to the analysis, a result that had been significant became non-significant. They couldn’t understand why, since each compound was tested independently and they assumed adding another compound shouldn’t affect the others’ results. Even more confusing, the same effect size was significant in another lab’s analysis but not in theirs. What could be the issue here?

To make sense of these confusing statistical results, we need to think about the treatment effect and its statistical inference separately. The treatment effect is the actual difference in outcomes between a compound and control. When the true value of the treatment effect is unknown, researchers try to estimate it with an experiment. Statistical inference determines whether the observed treatment effect is statistically significant (indicated by small p-value). But even if a treatment effect truly exists, statistical inference doesn’t always confirm its existence. This is because statistical significance depends on sample size and variability within groups. Increasing sample size or reducing variability make the test statistic larger, which lead to smaller p-values and detection of the effect (Wasserman, 2004, ch.10). Likewise, smaller sample sizes or increased variability lead to bigger p-values and potentially concluding no effect exists (even if it really does). But there is yet another factor affecting statistical inference: the number of comparison groups in a multiple-comparison setting. This is what occurred in the chemistry lab example.

To understand this puzzle, we start by examining how multiple comparisons inflate Type I error rates.

Multiple comparisons and the risk of inflated Type I error

When multiple hypotheses are tested on the same set of samples, for example, evaluating 20 components in one experiment and comparing each compound to a control, the chance of obtaining false-positive results is much higher than when making a single comparison. Think about flipping a fair coin five times. Getting five heads in a single sequence is unlikely, but if you repeat this experiment many times, the chance of observing at least one sequence of five heads increases. The same logic applies to multiple comparisons: when you test many hypotheses, the chance of rejecting at least one true null hypothesis by chance increases rapidly. This StatLab article nicely demonstrates how multiple comparisons inflate false positives (Type I errors). Here we have an important question: how should we account for this increased risk of Type I errors in our statistical inference?

General methods for controlling Type I error inflation

There are two main approaches to multiplicity control: controlling Family-Wise Error Rate (FWER) and False Discovery Rate (FDR).

The family-wise error rate (FWER) is the probability of making at least one false rejection among all tests. For example, when we test a null hypothesis (that truly is null) at \(\alpha\) = 0.05, there is a 5% chance that we reject the hypothesis by chance. However, if we test 20 null hypotheses (all truly null) at the same \(\alpha\) level, the chance increases to about 64%.¹ To control FWER, we can use p-value adjustment methods such as Bonferroni, Holm, Hochberg, and Hommel (Lee & Lee, 2018; Chen et al., 2017). Alternatively, specialized procedures like Tukey’s HSD (for all pairwise comparisons) and Dunnett’s test (for comparisons to control) have built-in FWER control .

False discovery rate (FDR) is the proportion of false positives among significant results when you test many hypotheses at one time. For example, if you declare 20 results significant and FDR = 0.05, then 5% of the 20 results are false positives. This means that on average, 1 result out of 20 significant results (0.05 × 20 = 1) is expected to be a false positive. Controlling FDR is less conservative and gives higher statistical power than controlling family-wise error rate, and thus this approach is more commonly used in exploratory screening. The Benjamini–Hochberg procedure, the Benjamini–Yekutieli procedure, and q-values are commonly used methods. For more information about the FDR, see the resources provided by Columbia Public Health – False Discovery Rate

Why Dunnett’s test?

Which multiplicity adjustment is appropriate depends on the structure of the scientific question. In the chemistry lab example, the goal was to compare each compound against a single shared control. Suppose there are three compounds A, B, C and one control group. A full pairwise comparison would test all possible pairs—not only A-Control, B-Control, and C-Control, but also A-B, A-C, and B-C. However, Dunnett’s test compares only each compound with the control.

Dunnett’s statistical test is computed the same way as a regular t-test. What differs is how we determine significance. A regular t-test uses the univariate t-distribution to calculate p-values, whereas Dunnett’s test uses a joint multivariate t-distribution of the treatment-control contrast (Bretz et al., 2010). And the distribution depends on the number of treatment groups being compared. As the number of groups increases, the critical value corresponding to \(\alpha\) also increases. This explains why the same effect size and standard error can be significant with fewer comparisons but non-significant with more comparisons.

How the number of comparisons changes P-values in Dunnett’s test: Visualizing the maximum-t framework

Now let’s visually demonstrate how the same t-statistic can yield different p-values as the number of comparisons increases, explaining why Dunnett’s critical values must increase to maintain the family-wise error rate (FWER) at 5%.

We simulate three scenarios following the maximum-t framework described by Bretz et al. (2010). The scenarios differ only in the number of treatment-versus-control comparisons: 1 comparison (regular t-test), 5 comparisons, and 10 comparisons. All other conditions are fixed: each group has a sample size of 30, and the correlation among the treatment–control t-statistics is 0.5.²

For each scenario, we generate 10,000 samples from the multivariate t-distribution using the rmvt() function from the mvtnorm package (Genz et al., 2009), and from each sample, the maximum t-statistic is extracted. For example, suppose we compare five groups against a control and obtain the following test statistics, then the maximum statistic is 3.5.

\[\begin{aligned} t_{\text{A-Control}} &= {\small 2.0} \\ t_{\text{B-Control}} &= {\small 3.5} \\ t_{\text{C-Control}} &= {\small 1.8} \\ t_{\text{D-Control}} &= {\small 2.3} \\ t_{\text{E-Control}} &= {\small 1.5} \end{aligned}\]

Repeating this 10,000 times produces 10,000 maximum t-statistics, forming the empirical distribution. Why do we use the maximum t-statistic distribution? That is because the family-wise error rate (FWER) is the probability of at least one false rejection. This refers to the probability that the largest t-statistic exceeds a chosen critical value purely by chance. By choosing the critical value such that \(P(\max T \ge c)= \alpha\), we ensure that the probability of at least one Type I error does not exceed \(\alpha\). For any observed test statistic, the adjusted p-value is computed as \(p_j^{\text{adj}} = P(\max T \ge T_j^{\text{obs}})\), where the probability is evaluated under the global null hypothesis³ using the distribution of the maximum \(t\)-statistic. This adjusted p-value can then be directly compared with the family-wise error rate (FWER) level.


library(mvtnorm)
library(ggplot2)


set.seed(1984)

n <- 30 # sample size
n_sim <- 10000 # number of simulation 
alpha <- 0.05 # target FWER
rho <- 0.5  # correlation due to shared control (equal n Dunnett case)

df_for_k <- function(k, n) (k + 1) * (n - 1) # k=number of treatment vs control comparisons

simulate_max_t <- function(k, n, n_sim, rho = 0.5) {
  df <- df_for_k(k, n)
  cor_matrix <- matrix(rho, k, k)
  diag(cor_matrix) <- 1
  
  # Generate samples directly from multivariate t-distribution
  samples <- rmvt(n = n_sim, sigma = cor_matrix, df = df)
  
  # Extract maximum from each row (one-sided test)
  # For a two-sided test, use max(abs())
  apply(samples, 1, max) 
}

# Simulate maximum t for each scenario
max_t_1  <- simulate_max_t(k = 1,  n, n_sim, rho)
max_t_5  <- simulate_max_t(k = 5,  n, n_sim, rho)
max_t_10 <- simulate_max_t(k = 10, n, n_sim, rho)

# Empirical critical values (simultaneous; one-sided): P(max T >= c) = alpha

cv_1  <- unname(quantile(max_t_1,  1 - alpha))
cv_5  <- unname(quantile(max_t_5,  1 - alpha))
cv_10 <- unname(quantile(max_t_10, 1 - alpha))  

# critical values increase with the number of comparisons (k)
round(c(cv_1 = cv_1, cv_5 = cv_5, cv_10 = cv_10),2)


 cv_1  cv_5 cv_10 
 1.68  2.28  2.49

The critical values indicate how large the observed t-statistic must be in order to declare significance while controlling the family-wise error rate at 5%. We see that the critical value increases as the number of comparisons increases. To reject the null hypothesis, the t-statistic must be larger than 1.68 for a single comparison, larger than 2.28 for five comparisons, and larger than 2.49 for ten comparisons.

Now let’s evaluate how the adjusted p-value changes as the number of comparisons increases for a fixed t-statistic of 2.1. Recall that the p-value is the probability of observing a test statistic as extreme as or more extreme than the observed value under the null hypothesis. We estimate this empirically by calculating the proportion of simulations where the maximum t-statistic is \(\ge\) 2.1, which gives us the adjusted p-value.


t_obs = 2.1

p_adj_1  <- mean(max_t_1  >= t_obs) 
p_adj_5  <- mean(max_t_5  >= t_obs)
p_adj_10 <- mean(max_t_10 >= t_obs)

round(c(p_adj_1 = p_adj_1, p_adj_5 = p_adj_5, p_adj_10 = p_adj_10),2)


 p_adj_1  p_adj_5 p_adj_10 
    0.02     0.08     0.12

We see that adjusted p-value also changes with the number of comparisons. For a single comparison, the adjusted p-value is 0.02.⁴ However, it increases to 0.08 and 0.12 when 5 and 10 group comparisons are included, respectively.

To further illustrate this relationship, let’s visualize the distribution of the maximum t-statistic across different numbers of comparisons.


# data
sim_data <- data.frame(
  value = c(max_t_1, max_t_5, max_t_10),
  scenario = factor(rep(c("1 comparison", "5 comparisons", "10 comparisons"),
                        each = n_sim),
                    levels = c("1 comparison", "5 comparisons", "10 comparisons"))
)

# Plot
ggplot(sim_data, aes(x = value, fill = scenario)) +
  geom_density(alpha = 0.5) +

  # dashed line for the regular t-statistic
   geom_vline(xintercept = cv_1,
             linetype = "dashed",
             color = "#0072B2",
             linewidth = 0.5) +

  
  # Add probability annotations (these are P(max T ≥ t_obs))
  annotate("text",
           x = -3, y = 0.43,
           label = paste0("                p = ", round(p_adj_1, 2)),
           color = "#0072B2",
           size = 4,
           fontface = "bold") +
  annotate("text",
           x = -3, y = 0.40,
           label = paste0("adjusted p = ", round(p_adj_5, 2)),
           color = "#D55E00",
           size = 4,
           fontface = "bold") +
  annotate("text",
           x = -3, y = 0.37,
           label = paste0("adjusted p = ", round(p_adj_10, 2)),
           color = "#009E73",
           size = 4,
           fontface = "bold") +

  xlim(-5, 5) +
  labs(title = "Multivariate t-Distribution of Maximum Statistic",
       subtitle = "The same t-statistic yields larger p-values as the number of comparisons increases (one-sided)",
       x = "Maximum t-statistic",
       y = "Density") +
  theme_minimal() +
  scale_fill_manual(values = c("#0072B2", "#D55E00", "#009E73")) +
  theme(legend.position = "right") +
  ylim(0,0.6)

Multivariate t-distribution of maximum statistic.

As the number of comparisons increases from 1 to 10, the distribution of the maximum t shifts to the right, and thus the critical values and adjusted p-values increase. The vertical dashed line indicates the critical value of 1.7 for a single comparison. We see how this falls in different regions of each distribution; far right tail for 1 comparison, but much closer to the center for 10 comparisons.⁵ This explains why the chemistry lab’s result became non-significant after adding an additional compound: increasing the number of comparisons shifts the reference distribution of the maximum statistic, thus raising the significance threshold. If the other lab tested fewer compounds (e.g., five), the corresponding critical value would be smaller. We now turn to practical implementation.

Implementing Dunnett’s test in R

We will demonstrate Dunnett’s test using mock chemistry lab data with 1 control and 10 compounds, each tested on 3 separate days (n=3 per group). The outcome is reaction time (seconds). Let’s assume that we are looking for compounds with shorter reaction times than the control. We randomly generated the data from normal distribution with different mean and equal variance. We already know that some compounds’ reaction times, such as T8, T9 and T10 are shorter than that of the control. To ensure all comparisons are made against the control, we set the control as the reference group using the ref argument.


set.seed(1984)

# The levels are ordered from T1 to T10, with C as the reference.
compound <- relevel(
  factor(rep(c("C", paste0("T", 1:10)), each = 3),
         levels = c("C", paste0("T", 1:10))),
  ref = "C"
)

reaction_time <- c(
  rnorm(3, mean = 100, sd = 5),   # Control
  rnorm(3, mean = 99,  sd = 5),   # T1
  rnorm(3, mean = 92,  sd = 5),   # T2
  rnorm(3, mean = 98,  sd = 5),   # T3
  rnorm(3, mean = 90,  sd = 5),   # T4
  rnorm(3, mean = 97,  sd = 5),   # T5
  rnorm(3, mean = 93,  sd = 5),   # T6
  rnorm(3, mean = 86,  sd = 5),   # T7
  rnorm(3, mean = 81,  sd = 5),   # T8
  rnorm(3, mean = 74,  sd = 5),   # T9
  rnorm(3, mean = 79,  sd = 5)    # T10
)

The first few rows of the generated data looks like this.


chem_data <- data.frame(compound, reaction_time )   
head(chem_data)


  compound reaction_time
1        C     102.04602
2        C      98.38488
3        C     103.17926
4       T1      89.76936
5       T1     103.76824
6       T1     104.94245

Before proceeding with the analysis, we first examined the mean and standard deviation and created boxplots of the data.


library(dplyr)

summary_table <- chem_data |>
  group_by(compound) |>
  summarise(
    mean = round(mean(reaction_time), 2),
    sd   = round(sd(reaction_time), 2)
  )

summary_table


# A tibble: 11 × 3
   compound  mean    sd
   <fct>    <dbl> <dbl>
 1 C        101.   2.51
 2 T1        99.5  8.44
 3 T2        90.6  3.61
 4 T3       101.   3.26
 5 T4        93.1  2.68
 6 T5        97.1  8.76
 7 T6        92.4  7.46
 8 T7        87.1  3.98
 9 T8        81.5  0.66
10 T9        76.1  4.36
11 T10       75.7  5.54


ggplot(chem_data, aes(x = compound, y = reaction_time)) +
  geom_boxplot() +
  labs(title = "Boxplot of Reaction Time by Compound",
       x="Compound",
       y="Reaction Time")

The observed group means are close to the prespecified means. Although all data were generated with the same variability (SD = 5), the observed variability differs across compounds due to the small sample size (n = 3 per group).

To test whether the compound reduces reaction time compared with the control, we begin with an ANOVA to test for overall group differences. ANOVA models can be fitted using the lm() function. Use the anova() function to view the ANOVA table.


model_anova <- lm(reaction_time ~ compound, data = chem_data)
anova(model_anova)


Analysis of Variance Table

Response: reaction_time
          Df Sum Sq Mean Sq F value    Pr(>F)    
compound  10 2654.9 265.495  9.5191 6.557e-06 ***
Residuals 22  613.6  27.891                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA test shows that at least one group differs from the others (F = 9.52, p < 0.001). However, we do not yet know which specific compounds differ from the control. We will use a post hoc multiple comparison test to determine this.

For multiple comparison analysis, we can use the multcomp package (Hothorn et al., 2008). The glht() function performs multiple comparisons on a fitted ANOVA model. The mcp() function specifies which groups to compare; here we use the predefined Dunnett contrast for many-to-one comparisons. Since we previously set the control as the reference group, each compound is automatically compared to the control. When the direction of the effect is known, a one-sided test gives greater power. Because we expect reaction time of the compounds to be shorter, we specify alternative = "less" for the test.


library(multcomp)

Dunnett_res <- glht(model_anova, linfct = mcp(compound = "Dunnett"), alternative = "less")
summary(Dunnett_res)



     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts


Fit: lm(formula = reaction_time ~ compound, data = chem_data)

Linear Hypotheses:
             Estimate Std. Error t value Pr(<t)    
T1 - C >= 0   -1.7100     4.3121  -0.397 0.8010    
T2 - C >= 0  -10.5642     4.3121  -2.450 0.0695 .  
T3 - C >= 0    0.1065     4.3121   0.025 0.9140    
T4 - C >= 0   -8.0722     4.3121  -1.872 0.1886    
T5 - C >= 0   -4.0881     4.3121  -0.948 0.5689    
T6 - C >= 0   -8.7795     4.3121  -2.036 0.1449    
T7 - C >= 0  -14.0625     4.3121  -3.261 0.0128 *  
T8 - C >= 0  -19.6600     4.3121  -4.559 <0.001 ***
T9 - C >= 0  -25.1301     4.3121  -5.828 <0.001 ***
T10 - C >= 0 -25.4903     4.3121  -5.911 <0.001 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- single-step method)

The summary() function gives us the results of the multiple comparisons. The reaction times of T7 to T10 are significantly reduced compared to control.

Confidence interval and plot

A simultaneous confidence interval describes the range of mean differences between each compound and the control at a specified confidence level, here 95%. Because this is a one-sided test, the lower bound is not of interest and is therefore unbounded. A negative upper bound of the confidence interval indicates that the compound has a shorter reaction time than the control.


ci = confint(Dunnett_res, level=0.95)
ci



     Simultaneous Confidence Intervals

Multiple Comparisons of Means: Dunnett Contrasts


Fit: lm(formula = reaction_time ~ compound, data = chem_data)

Quantile = 2.6189
95% family-wise confidence level
 

Linear Hypotheses:
             Estimate lwr      upr     
T1 - C >= 0   -1.7100     -Inf   9.5829
T2 - C >= 0  -10.5642     -Inf   0.7287
T3 - C >= 0    0.1065     -Inf  11.3994
T4 - C >= 0   -8.0722     -Inf   3.2207
T5 - C >= 0   -4.0881     -Inf   7.2048
T6 - C >= 0   -8.7795     -Inf   2.5134
T7 - C >= 0  -14.0625     -Inf  -2.7696
T8 - C >= 0  -19.6600     -Inf  -8.3671
T9 - C >= 0  -25.1301     -Inf -13.8372
T10 - C >= 0 -25.4903     -Inf -14.1974

The simultaneous confidence intervals are consistent with the p-values; as shown in the plot, the upper bounds for T7-T10 are all negative.


plot(ci,
     main = "One-sided 95% Simultaneous Dunnett Confidence Intervals",
     xlab = "Difference in mean reaction time (Compound − Control)")
abline(v = 0, lty = 2, col = "red")

One-sided 95% simultaneous Dunnett confidence intervals.

Improving power in Dunnett’s Test

So far, we have discussed the risk of Type I error inflation and how to adjust p-values to control it when performing multiple comparisons. However, the efforts to reduce false positives typically come at the cost of reduced power to detect real effects. Fortunately, there are several ways to improve power in Dunnett’s test.

As our simulation demonstrated, each additional comparison increases the critical value, making it harder to detect effects. To maintain power, the number of comparisons should be limited to those essential for the research question. Power analysis can help determine either the maximum number of comparisons possible for a target power level or the minimum sample size needed for a planned set of comparisons.
If the analysis is exploratory rather than confirmatory, controlling the FDR, which is less conservative than FWER, is more appropriate and provides greater power.
Neuhäuser et al. (2021) showed that allocating \(\sqrt{k}\) times as many observations to the control group as to each treatment group (where \(k\) is the number of treatments) is more powerful than equal allocation across all groups. For example, with 5 treatment and \(n\)=10 per treatment, the square-root allocation rule suggests allocating about 22 observations (\(\sqrt{k} \times 10\)) to the control.
Lastly, the step-down Dunnett test is more powerful than the single-step procedure (the default in mcp()). The step-down algorithm is available with adjusted (type = "free") in the summary() function. See Bretz et al. (2010) for more details on the step-down algorithm.


summary(Dunnett_res, test=adjusted(type="free"))



     Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts


Fit: lm(formula = reaction_time ~ compound, data = chem_data)

Linear Hypotheses:
             Estimate Std. Error t value   Pr(<t)    
T1 - C >= 0   -1.7100     4.3121  -0.397 0.500633    
T2 - C >= 0  -10.5642     4.3121  -2.450 0.049208 *  
T3 - C >= 0    0.1065     4.3121   0.025 0.509740    
T4 - C >= 0   -8.0722     4.3121  -1.872 0.108352    
T5 - C >= 0   -4.0881     4.3121  -0.948 0.349867    
T6 - C >= 0   -8.7795     4.3121  -2.036 0.094556 .  
T7 - C >= 0  -14.0625     4.3121  -3.261 0.009832 ** 
T8 - C >= 0  -19.6600     4.3121  -4.559 0.000558 ***
T9 - C >= 0  -25.1301     4.3121  -5.828 2.59e-05 ***
T10 - C >= 0 -25.4903     4.3121  -5.911 2.59e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- free method)

Compared to the single-step result, the step-down procedure increased power:T2 became significant ( originally adjusted p_value = 0.07) and T7-T10 also have smaller p-values (though already significant).

R session details

All analyses were performed in R (v4.5.1; R Core Team, 2025) on macOS Tahoe 26.2 (Apple M1 Pro architecture), using the packages mvtnorm (v1.3.3), multcomp (v1.4.29), and ggplot2 (v3.5.2).

Acknowledgement

Generative AI (Claude Sonnet) assisted with editing for clarity and grammar. The R simulation code was initially drafted with AI assistance based on the maximum-t framework described in Bretz et al. (2010) and was subsequently reviewed, tested, and refined by the author.

References

Bretz, F., Hothorn, T., & Westfall, P. (2010). Multiple Comparisons Using R. CRC Press.
Chen, S. Y., Feng, Z., & Yi, X. (2017). A general introduction to adjustment for multiple comparisons. Journal of Thoracic Disease, 9(6), 1725–1729. https://doi.org/10.21037/jtd.2017.05.34
Dean A and Voss D (1999). Design and Analysis of Experiments. Springer.
Genz, A., & Bretz, F. (2009). Computation of Multivariate Normal and t Probabilities. Lecture Notes in Statistics. Springer.
Hothorn T, Bretz F, Westfall P (2008). “Simultaneous Inference in General Parametric Models.” Biometrical Journal, 50(3), 346-363.
Lee, S., & Lee, D. K. (2018). What is the proper way to apply the multiple comparison test? Korean Journal of Anesthesiology, 71(5), 353–360. https://doi.org/10.4097/kja.d.18.00242
Neuhäuser, M., Mackowiak, M. M., & Ruxton, G. D. (2021). Unequal sample sizes according to the square-root allocation rule are useful when comparing several treatments with a control. Ethology, 127(12), 1094–1100. https://doi.org/10.1111/eth.13230
Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. Springer.

Hyeseon Seo
Statistical Research Consultant
University of Virginia Library
March 11, 2026

If each test is conducted at \(\alpha\) = 0.05, the probability of not making a Type I error in a single test is 0.95; assuming independence, the probability that none of 20 tests produces a false positive is \(0.95^{20}\) \(\approx\) 0.36, so the probability of at least one false positive is \(1 - 0.95^{20} \approx\) 0.64 (64%).↩︎
The correlation in Dunnett’s test arises because all treatments share the same control group-if the control mean is unusually high or low by chance, all comparisons are affected simultaneously. When the same number of observations is taken for the treatment and control groups, the correlation is approximately 0.5 (Dean and Voss, 1999, p. 87). By accounting for this correlation, Dunnett’s test is more powerful than general methods like Bonferroni that assume independence.↩︎
The global null hypothesis means that there is no difference between any compound and the control: \(H_0: \mu_j = \mu_0\) for \(j=1,\ldots,k\) ↩︎
The standard one-sided p-value is computed as 1 - pt(2.1, df = 30+30-2), which is about 0.02. This is similar to the result obtained from a single-group comparison in the simulation.↩︎
Note that this is a one-sided test, so we observe only a rightward shift. For a two-sided test, we would take the maximum of the absolute values (using max(abs())), the distribution would shift in both directions.↩︎