StatLab Articles

Understanding Polychoric Correlation

Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation.

R, simulation, statistical methods, polychoric correlation, maximum likelihood, Clay Ford

Power and Sample Size Calculations for Ordered Categorical Data

In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.

R, statistical methods, simulation, power analysis, Clay Ford

Bootstrapped Association Rule Mining in R

Association rule mining is a machine learning technique designed to uncover associations between categorical variables in data. It produces association rules which reflect how frequently categories are found together. For instance, a common application of association rule mining is the “Frequently Bought Together” section of the checkout page from an online store. Through mining across transactions from that store, items that are frequently bought together have been identified (e.g., shaving cream and razors).

R, association rule mining, bootstrap, statistical methods, Laura Jamison

Spatial R: Using the sf package

The spread of disease, politics, the movement of animals, regions vulnerable to earthquakes and where people are most likely to buy frosted flakes are all informed by spatial data. Spatial data links information to specific positions on earth and can tell us about patterns that play out from location to location. We can use spatial data to uncover processes over space and tackle complex problems.

R, visualization, spatial data, Lauren Brideau

Assessing Model Assumptions with Lineup Plots

When fitting a linear model we make two assumptions about the distribution of residuals:

statistical methods, diagnostic plots, qqplot, visualization, R, Clay Ford

Bootstrapping Residuals for Linear Models with Heteroskedastic Errors Invites Trouble

Bootstrapping—resampling data with replacement and recomputing quantities of interest—lets analysts approximate sampling distributions for complex estimators and frees them of the reliably unmet assumptions of traditional, parametric inferential statistics. It’s an elegant, intuitive approach in which an analyst exploits the (often) parallel resample-to-sample and sample-to-population relationships to understand uncertainty in an estimate.

R, statistical methods, simulation, bootstrap, nonparametric statistics, Jacob Goldstein-Greenwood

Power and Sample Size Estimation for Logistic Regression

In this article we demonstrate how to use simulation in R to estimate power and sample size for proposed logistic regression models that feature two binary predictors and their interaction.

Recall that logistic regression attempts to model the probability of an event conditional on the values of predictor variables. If we have a binary response, y, and two predictors, x and z, that interact, we specify the logistic regression model as follows:

R, logistic regression, power analysis, simulation, statistical methods, Clay Ford

Understanding Somers' D

When it comes to summarizing the association between two numeric variables, we can use Pearson or Spearman correlation. When accompanied with a scatterplot, they allow us to quantify association on a scale from -1 to 1. But what if we have two ordered categorical variables with just a few levels? How can we summarize their association? One approach is to calculate Somers’ Delta, or Somers’ D for short.

R, statistical methods, AUC, logistic regression, Clay Ford

Starting with Non-Metric Multidimensional Scaling (NMDS)

Real world problems and data are complex and there are often situations where we want to simultaneously look at relationships between more than one variable. We call these analyses multivariate statistics. Non-metric multidimensional scaling, or NMDS, is one multivariate technique that allows us to visualize these complex relationships in less dimensions. In other words, NMDS takes complex, multivariate data and represents the relationships in a way that is easier for interpretation.

R, NMDS, statistical methods, Non-Metric Multidimensional Scaling, Lauren Brideau

Graphical Linearity Assessment for One- and Two-Predictor Logistic Regressions

Logistic regression is a flexible tool for modeling binary outcomes. A logistic regression describes the probability, \(P\), of 1/“yes”/“success” (versus 0/“no”/“failure”) as a linear combination of predictors:

\[log(\frac{P}{1-P}) = B_0 + B_1X_1 + B_2X_2 + ... + B_kX_k\]

R, logistic regression, simulation, model assessment, statistical methods, Jacob Goldstein-Greenwood