Assessing Type S and Type M Errors

The paper Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors by Andrew Gelman and John Carlin introduces the idea of performing design calculations to help prevent researchers from being misled by statistically significant results in studies with small samples and/or noisy measurements. The main idea is that researchers often overestimate effect sizes in power calculations and collect noisy (i.e., highly variable) data, which can make statistically significant results suspect. This can lead to estimated effects that are greatly exaggerated and even in the wrong direction. We suggest reading the paper if you haven't already. It's very accessible and highly instructive. What we want to do in this article is supplement some of the examples with a bit more explanation and R code for those who need a little extra help with the statistics.

When Gelman and Carlin speak of the "design of the study" and "design calculations," they're referring to the relationship between a true effect size and what can be learned from the data. For example, let's say you want to test the hypothesis that allowing students to chew gum while they take a test helps them perform better than those who do not. If in fact this hypothesis is true, what is the effect of chewing gum? Will it increase performance by an average of 10 points on a test? Probably not. If it helps at all, the effect will likely be very slight. There are many other more important factors that contribute to academic performance. Nevertheless, you create an experiment where 100 students chew gum while testing and 100 do not. Surely that's a big enough sample size to safely detect the effect of chewing gum, right? And that's where design calculations come in. They allow us to explore what we can learn from our data given various effect sizes. So if the true effect of chewing gum is a 0.01 point increase in test performance, then we won't learn much from a sample of 200 students. The sample is way too small. And not only that, but there's a chance that we'll overestimate the effect of chewing gum or come up with an estimate that's in the wrong direction.

Now you may be thinking, "I thought that's what power and sample size analysis was for. Wouldn't that tell me to use more than 200 people?" It would if you specified the true effect size as well as a reasonable standard deviation of test scores. But power and sample size analyses are focused on statistical significance, i.e., correctly rejecting some null hypothesis. They don't inform us of the potential for overestimating effects or estimating effects with the opposite sign.

Let's do a power and sample size analysis for our hypothetical study of chewing gum during tests and then follow it up with some design calculations to get a better understanding of what each does.

A power analysis tells us the probability that our experiment will correctly reject the null hypothesis. We typically want at least 80% power. Let's say we decide we'll use a two-sample t-test to analyze our study on chewing gum. We'll get mean test scores from the group that chews gum and mean test scores from the group that does not. Then we'll compare the means. We'll reject the null hypothesis of no effect if we get a p-value below 0.05. The power.t.test() function in R allows us to quickly estimate power. We plug in n, the number of subjects per group. Then we plug in delta, the hypothesized true difference in means (i.e., the effect size), and sd, the hypothesized standard deviation for each of our groups. (A common approach to determining standard deviation for data yet to be collected is to divide the range of plausible values by 4. For example, assuming we score on a 100 point scale and no one scores below 50, (100 - 50)/4 = 12.5.) By default a significance level of 0.05 is assumed. Perhaps we assume a standard deviation of 12.5 and an effect size of 5.


power.t.test(delta = 5, sd = 12.5, n = 100)


     Two-sample t test power calculation 

              n = 100
          delta = 5
             sd = 12.5
      sig.level = 0.05
          power = 0.8036466
    alternative = two.sided

NOTE: n is number in *each* group

This tells us we have 80% power. If the true effect of chewing gum really is to increase (or decrease) mean test scores by 5 points, then we have an 80% chance of correctly rejecting the null hypothesis with 100 subjects in each group. Obviously this power estimate is completely contingent upon our proposed effect size being true. Most people would be skeptical of the idea that simply chewing gum can raise (or lower) mean test scores by 5 points. But effect sizes are not always proposed based on subject matter expertise or common sense. They're often determined as the minimum effect that would be substantively important. In other words, they are not proposed because they're realistic but because they represent some minimum effect that one hopes to detect.

Let's now proceed to design calculations. The idea here is to replicate, or simulate, many studies with an identical design. In this case we could randomly generate 10,000 samples from a normal distribution with some mean (our selected effect size) and a standard error of 1.77. Why 1.77? That's the pooled standard error for a t-test when both populations have a standard deviation of 12.5 and 100 are sampled from each group.

$$ 12.5\sqrt{\frac{1}{100} + \frac{1}{100}} \approx 1.77 $$

We want to simulate estimated effects from many two-sample t-tests using an effect size of our choice and the assumed standard error of 1.77. (We could also sample from a t distribution, but in this case it doesn't really make a difference.)

Now as much as we might hope that chewing gum could add an expected 5 points to a test score, we should consider the consequences of the effect being much smaller. For example, what if the effect is actually about 0.5? Let's replicate the study 10,000 times assuming that effect size and our standard error.


set.seed(12)
se <- sqrt(12.5^2 * 2 / 100) # about 1.77
r.out <- rnorm(n = 1e4, mean = 0.5, sd = 1.77)

r.out stores the results of our 10,000 replicated studies. We can think of each value in r.out as the estimated effect of chewing gum on test scores if we replicated the study 10,000 times with an assumed effect of 0.5 and a standard error of 1.77.

The first consequence we want to check is the probability that we get a statistically significant estimate with a wrong sign. Gelman and Carlin call this Type S error. Below we find the proportion of estimates that are more than 1.96 standard errors away from 0 (i.e., statistically significant at the 0.05 level), and of those, the proportion less than 0 (i.e., have the wrong sign).


mean(r.out[r.out < -se*1.96 | r.out > se*1.96] < 0)


[1] 0.2070707

If the true effect size is 0.5 and our study finds a significant effect, there's a 20% chance that effect will have the wrong sign.

The next consequence to check is the ratio of the expected significant effect to the true effect size. Gelman and Carlin call this Type M error, or the exaggeration ratio. To calculate this we take the mean of the absolute value of the estimated effect and divide it by the hypothesized effect size. Again we only do this for those replications that achieve "significance."


mean(abs(r.out[r.out < -se*1.96 | r.out > se*1.96]))/0.5


[1] 8.298854

If the true effect size is 0.5 and our study finds a significant effect, we can expect the absolute value of the estimated effect to be about 8 times too high. Clearly there are serious consequences to consider if we overestimate effect sizes when planning a study.

We can also use our simulated data to estimate the power of our study if the effect size is 0.5. Simply find the proportion of values that are more than 1.96 standard errors away from 0.


mean(abs(r.out) > se*1.96)


[1] 0.0594

About 6%. Of course this is about the same power we would have estimated if we had assumed such a small effect size in the first place:


power.t.test(n = 100, delta = 0.5, sd = 12.5)


     Two-sample t test power calculation 

              n = 100
          delta = 0.5
             sd = 12.5
      sig.level = 0.05
          power = 0.04662574
    alternative = two.sided

NOTE: n is number in *each* group

But such small effects are not always considered in power and sample size calculations. As we mentioned earlier, the usual suggestion is to select the smallest effect that would be substantively important. But "substantively important" doesn't necessarily translate to a realistic effect size. Again the idea behind design calculations is not to estimate power in another way but to explore the potential for error in a study design if the true effect size is actually much smaller than expected.

Gelman and Carlin provide a simple function to estimate Type M and Type S error in their paper called retrodesign(). Submitting the following code chunk will load the function into your R session. The A argument is for the effect size and the s argument is for the standard error. alpha=.05 is the default significance level, df=Inf is the default degrees of freedom for the t distribution, and n.sims=10000 is the default number of simulations.


retrodesign <- function(A, s, alpha=.05, df=Inf, n.sims=10000){
  z <- qt(1-alpha/2, df)
  p.hi <- 1 - pt(z-A/s, df)
  p.lo <- pt(-z-A/s, df)
  power <- p.hi + p.lo
  typeS <- p.lo/power
  estimate <- A + s*rt(n.sims,df)
  significant <- abs(estimate) > s*z
  exaggeration <- mean(abs(estimate)[significant])/A
  return(list(power=power, typeS=typeS, exaggeration=exaggeration))
}

Here are the results retrodesign() returned for our example study. We set df=198 since a two-sample t-test for two groups of 100 subjects has 198 degrees of freedom.


retrodesign(A = 0.5, s = se, df = 198)


$`power`
[1] 0.05899831

$typeS
[1] 0.2138758

$exaggeration
[1] 8.359884

The 2nd and 3rd elements are the Type S and Type M results, respectively, which basically match what we calculated "by hand." The first element is "power," which is the probability a replication achieves statistical significance. Since Type M error is based on simulated replicates, your result will differ slightly.

Let's walk through the function code. The first line returns the quantile, or the point on the x axis, below which lies 0.975 of the area under a t distribution with df degrees of freedom. When df=Inf, this is a fancy way to get 1.96. The next two lines calculate the probability of exceeding positive and negative values of z - A/s. The quantity z - A/s is the critical value that defines statistical significance in the analysis. Let's say A/s is our test statistic; then recall we declare significance at the 0.05 level if

$$ \frac{A}{s} > z $$ where z = 1.96.

Therefore adding or subtracting A/s from z will give us the adjusted critical values for our expected test statistic. The total probability of exceeding these critical values in either direction is our estimated power.

The next line calculates Type S error by dividing the probability of getting a negative test statistic by the total probability of correctly rejecting the null hypothesis with a positive test statistic. Obviously the function assumes we'll be working with a positive effect size! The next line samples 10,000 values from a t distribution. This is where we replicate the design. The function scales the values by multiplying them by our assumed standard error, s, and then shifts them A units (since a t distribution is centered at 0). The following line creates a vector of TRUE/FALSE called significant that identifies those replicates that exceed s*z in absolute value. The significant vector is used in the next line to extract the significant replicates and calculate the exaggeration ratio. Finally the desired calculations are returned in a list.

You'll notice the suggested name of their function is "retrodesign," which suggests retroactively performing a design analysis. Gelman and Carlin recommend running a design analysis "in particular when apparently strong evidence for nonnull effects has been found." In their paper they carry out two such examples of published research. They note this is not the same as ill-advised post-hoc power analyses for two reasons: (1) They focus on the sign and direction of effects (rather than statistical significance), and (2) they advocate using effect sizes that are "determined from literature review or other information external to the data at hand."

We hope you'll read Gelman and Carlin's paper and consider their observations and suggestions. This blog post is only intended to address some of the statistical concepts assumed in the paper.

References

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651. https://doi.org/10.1177/1745691614551642
R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

Clay Ford
Statistical Research Consultant
University of Virginia Library
October 31, 2018

For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.