StatLab Articles

Understanding Dunnett’s Test

Multiple comparison procedures are fundamental in experimental research. Dunnett’s test, which compares multiple treatments to a single control, is particularly common in laboratory studies. When multiple comparisons are made, proper statistical methods are essential to control false positives. This article demonstrates how the number of comparisons affects p-values in Dunnett’s test, implements the procedure in R, and discusses strategies to improve statistical power.

R, statistical methods, Dunnett's Test, multiple comparisons, FWER, Hyeseon Seo

Mixed Effect versus Fixed Effect Models

When faced with analyzing clustered or repeated measures data, some researchers and analysts turn to mixed effect modeling. Yet others when faced with the same situation turn to fixed effect modeling. Which one you choose is usually dictated by your field of study and statistical education. Those coming from fields like Psychology, Ecology, and Education often choose mixed effect modeling, while those coming from fields like Economics and Political Science typically choose fixed effect modeling.

R, statistical methods, mixed effect models, fixed effect models, GLS, Clay Ford

Making Maps with Raster Data in R

To work with raster data, we will be using a few different packages. If you do not have one or more of the packages, you can install them using install.packages(). After installing packages, you can load them using library().

R, visualization, spatial data, GIS, Lauren Brideau

Distribution-Free Confidence Intervals for Percentiles

Percentiles are order statistics. This means they’re determined by ordering observations from smallest to largest and then finding the value below which some percentage of the data lie. The most common percentile is the median. It’s simply the middle value (or the average of the two middle values if there are an even number of observations). Fifty percent of the data lie below the median. Other percentiles frequently of interest are the 25th and 75th percentiles. These are the data values below which lie 25 and 75 percent of the data, respectively.

R, statistical methods, confidence intervals, bootstrap, Clay Ford

Getting Started with Multiple Imputation for Longitudinal Data

Multiple Imputation (MI) is a method for dealing with missing data in a statistical analysis. The general idea of MI is to simulate values for missing data points using the data we have on hand, generating multiple new sets of complete data. We then run our proposed analysis on all the complete data sets and combine the results to obtain overall estimates. The end product is an analysis with proper standard errors and unbiased estimates.

multiple imputation, simulation, mixed effect models, R, statistical methods, Clay Ford

Addressing Multicollinearity

When a linear model has two or more highly correlated predictor variables, it is often said to suffer from multicollinearity. The danger of multicollinearity is that estimated regression coefficients can be highly uncertain and possibly nonsensical (e.g., getting a negative coefficient that common sense dictates should be positive). Multicollinearity is usually detected using variance inflation factors (VIF).

R, statistical methods, multicollinearity, ridge regression, PCA, Clay Ford

Correlation: Pearson, Spearman, and Kendall's tau

Correlation is a widely used method that helps us explore how two variables change together, providing insight into whether a relationship exists between them. For example, imagine we want to understand if there is an association between time spent studying and exam scores. Or, maybe we think that people who eat more cookies are happier. Or, we want to see if people who live near a park hear more birds singing in the morning. Correlation is a valuable tool for understanding the extent to which variables are associated.

R, correlation, statistical methods, spearman correlation, kendall tau, Lauren Brideau

Testing for Significance with Permutation-based Methods

When we perform statistical tests, we often want to obtain a p-value, which describes the probability of obtaining test results at least as extreme as the observed result, assuming that the null hypothesis is true. In other words, how likely would it be to observe an effect as large (or larger) than the observed effect from chance and chance alone if the null is true? Common statistical approaches such as t-tests, ANOVAs, and linear regression make assumptions about the data or the errors.

R, permutation, statistical methods, Ethan Kadiyala

Getting Started with Tweedie Models

Tweedie models are a special Generalized Linear Model (GLM) that can be useful when we want to model an outcome that sometimes equals 0 but is otherwise positive and continuous. Some examples include daily precipitation data and annual income. Data like this can have zeroes, often lots of zeroes, in addition to positive values. When modeling data of this nature, we may want to ensure our model does not predict negative values. We may also want to log-transform this data without dropping the zeroes. Tweedie models allow us to do both.

R, statistical methods, tweedie, simulation, zero-inflated models, Clay Ford

Getting Started with Multilevel Regression and Poststratification

Multilevel Regression and Poststratification (MRP) is a method of adjusting model estimates for non-response. By “non-response” we mean under-sampled groups in a population. For example, imagine conducting a phone survey to estimate the percentage of a population that approve of an elected official. It’s likely that certain age groups in the population will be under-sampled because they’re less likely to answer a call from an unfamiliar number. MRP allows us to analyze the data and adjust the estimate by taking the under-sampled groups into account.

R, statistical methods, mixed effect models, Bayesian methods, simulation, Clay Ford