StatLab Articles

Interpreting Log Transformations in a Linear Model

Log transformations are often recommended for skewed data, such as monetary measures or certain biological and demographic measures. Log transforming data usually has the effect of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 US states. It is skewed to the right due to Alaska, California, Texas and a few others.

R, linear regression, statistical methods, log transformations, diagnostic plots, Clay Ford

Getting Started with Matching Methods

Note: This article demonstrates how to use propensity scores for matching data. However, propensity scores have come under fire in recent years. In their 2019 article, Why Propensity Scores Should Not Be Used for Matching, King and Nielsen argue that propensity scores increase imbalance, inefficiency, model dependence, and bias. Others argue that while propensity scores may be sub-optimal, they can be useful in certain situations.

R, statistical methods, matching, propensity scores, Clay Ford

Getting Started with Moderated Mediation

In a previous post we demonstrated how to perform a basic mediation analysis. In this post we look at performing a moderated mediation analysis. The basic idea is that a mediator may depend on another variable called a "moderator". For example, in our mediation analysis post we hypothesized that self-esteem was a mediator of student grades on the effect of student happiness. We illustrate this below with a path diagram.

R, statistical methods, mediation, Clay Ford

Getting started with Multivariate Multiple Regression

Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does.

R, statistical methods, MANOVA, multivariate multiple regression, Clay Ford

Visualizing the Effects of Proportional-Odds Logistic Regression

Proportional-odds logistic regression is often used to model an ordered categorical response. By "ordered", we mean categories that have a natural ordering, such as "Disagree", "Neutral", "Agree", or "Everyday", "Some days", "Rarely", "Never". For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model.

R, effect plots, statistical methods, visualization, proportional odds logistic regression, Clay Ford

Getting Started with the purrr Package in R

If you're wondering what exactly the purrr package does, then this blog post is for you.

R, data wrangling, computational methods, purrr, Clay Ford

Working with Dates and Times in R Using the lubridate Package

Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on February 12, 2008, and exit on November 4, 2009. How many days was the person in the study? (Don’t forget 2008 was a leap year; February had 29 days.) What was the median time-in-study for all subjects?

R, data wrangling, lubridate, Clay Ford

The Wilcoxon Rank Sum Test

The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as "is your data normal?" A "no" branch off this question will recommend a Wilcoxon test if you're comparing two groups of continuous measures.

So what is this Wilcoxon test? What makes it non-parametric? What does that even mean? And how do we implement it and interpret it? Those are some of the questions we aim to address in this post.

R, statistical methods, nonparametric statistics, bootstrap, Wilcoxon test, Clay Ford

Pairwise comparisons of proportions

Pairwise comparison means comparing all pairs of something. If I have three items, A, B and C, that means comparing A to B, A to C, and B to C. Given n items, I can determine the number of possible pairs using the binomial coefficient: $$ \frac{n!}{2!(n - 2)!} = \binom {n}{2}$$ Using the R statistical computing environment, we can use the choose() function to quickly calculate this.

R, statistical methods, multiple comparisons, Clay Ford

Stata Basics: foreach and forvalues

There are times we need to do some repetitive tasks in the process of data preparation, analysis, or presentation. For instance, we may need to compute a set of variables in the same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, we show a few simple example "loops" using the Stata commands foreach, local and forvalues to handle some common repetitive tasks.

Stata, data management, data wrangling, Yun Tai