StatLab Articles

Correlation: Pearson, Spearman, and Kendall's tau

Correlation is a widely used method that helps us explore how two variables change together, providing insight into whether a relationship exists between them. For example, imagine we want to understand if there is an association between time spent studying and exam scores. Or, maybe we think that people who eat more cookies are happier. Or, we want to see if people who live near a park hear more birds singing in the morning. Correlation is a valuable tool for understanding the extent to which variables are associated.

R, correlation, statistical methods, spearman correlation, kendall tau, Lauren Brideau

Testing for Significance with Permutation-based Methods

When we perform statistical tests, we often want to obtain a p-value, which describes the probability of obtaining test results at least as extreme as the observed result, assuming that the null hypothesis is true. In other words, how likely would it be to observe an effect as large (or larger) than the observed effect from chance and chance alone if the null is true? Common statistical approaches such as t-tests, ANOVAs, and linear regression make assumptions about the data or the errors.

R, permutation, statistical methods, Ethan Kadiyala

Getting Started with Tweedie Models

Tweedie models are a special Generalized Linear Model (GLM) that can be useful when we want to model an outcome that sometimes equals 0 but is otherwise positive and continuous. Some examples include daily precipitation data and annual income. Data like this can have zeroes, often lots of zeroes, in addition to positive values. When modeling data of this nature, we may want to ensure our model does not predict negative values. We may also want to log-transform this data without dropping the zeroes. Tweedie models allow us to do both.

R, statistical methods, tweedie, simulation, zero-inflated models, Clay Ford

Getting Started with Multilevel Regression and Poststratification

Multilevel Regression and Poststratification (MRP) is a method of adjusting model estimates for non-response. By “non-response” we mean under-sampled groups in a population. For example, imagine conducting a phone survey to estimate the percentage of a population that approve of an elected official. It’s likely that certain age groups in the population will be under-sampled because they’re less likely to answer a call from an unfamiliar number. MRP allows us to analyze the data and adjust the estimate by taking the under-sampled groups into account.

R, statistical methods, mixed effect models, Bayesian methods, simulation, Clay Ford

Using Wavelets to Analyze Time Series Data

Time series data can contain a lot of information. Often, it is difficult to visually detect patterns in a time series and even harder to quantify patterns. How can we analyze our time series data to understand its underlying signals and how these signals are changing through time?

R, time series analysis, statistical methods, wavelets, Ethan Kadiyala

Understanding t-tests, ANOVA, and MANOVA

Imagine you love baking cookies and invite your friends over for a cookie party. You want to know how many cookies you should make so you ask your friends about how many cookies they think they will each eat. They respond:

Francesca: 5 cookies
Sydney: 3 cookies
Noelle: 1 cookie
James: 7 cookies
Brooke: 2 cookies

We take these numbers and add all of them together to estimate that about 18 cookies will be eaten in total at our party.

\[ 5 + 3 + 1 + 7 + 2 = 18 \text{ cookies total} \]

R, ANOVA, MANOVA, t-test, statistical methods, Lauren Brideau

Understanding Polychoric Correlation

Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation.

R, simulation, statistical methods, polychoric correlation, maximum likelihood, Clay Ford

Power and Sample Size Calculations for Ordered Categorical Data

In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.

R, statistical methods, simulation, power analysis, Clay Ford

Bootstrapped Association Rule Mining in R

Association rule mining is a machine learning technique designed to uncover associations between categorical variables in data. It produces association rules which reflect how frequently categories are found together. For instance, a common application of association rule mining is the “Frequently Bought Together” section of the checkout page from an online store. Through mining across transactions from that store, items that are frequently bought together have been identified (e.g., shaving cream and razors).

R, association rule mining, bootstrap, statistical methods, Laura Jamison

Spatial R: Using the sf package

The spread of disease, politics, the movement of animals, regions vulnerable to earthquakes and where people are most likely to buy frosted flakes are all informed by spatial data. Spatial data links information to specific positions on earth and can tell us about patterns that play out from location to location. We can use spatial data to uncover processes over space and tackle complex problems.

R, visualization, spatial data, Lauren Brideau

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!