StatLab Articles

Understanding Somers' D

When it comes to summarizing the association between two numeric variables, we can use Pearson or Spearman correlation. When accompanied with a scatterplot, they allow us to quantify association on a scale from -1 to 1. But what if we have two ordered categorical variables with just a few levels? How can we summarize their association? One approach is to calculate Somers’ Delta, or Somers’ D for short.

R, statistical methods, AUC, logistic regression, Clay Ford

Starting with Non-Metric Multidimensional Scaling (NMDS)

Real world problems and data are complex and there are often situations where we want to simultaneously look at relationships between more than one variable. We call these analyses multivariate statistics. Non-metric multidimensional scaling, or NMDS, is one multivariate technique that allows us to visualize these complex relationships in less dimensions. In other words, NMDS takes complex, multivariate data and represents the relationships in a way that is easier for interpretation.

R, NMDS, statistical methods, Non-Metric Multidimensional Scaling, Lauren Brideau

Graphical Linearity Assessment for One- and Two-Predictor Logistic Regressions

Logistic regression is a flexible tool for modeling binary outcomes. A logistic regression describes the probability, \(P\), of 1/“yes”/“success” (versus 0/“no”/“failure”) as a linear combination of predictors:

\[log(\frac{P}{1-P}) = B_0 + B_1X_1 + B_2X_2 + ... + B_kX_k\]

R, logistic regression, simulation, model assessment, Jacob Goldstein-Greenwood

Why Preallocate Memory in R Loops?

In R, “growing” an object—extending an atomic vector one element at a time; adding elements one by one to the end of a list; etc.—is an easy way to elicit a mild admonishment from someone reviewing or revising your code. Growing most frequently occurs in the context of for loops: A loop computes a value (or set of values) on each iteration, and it then appends the value(s) to an existing object.

R, simulation, preallocation, optimization, Jacob Goldstein-Greenwood

Regression to the Mean and Change Score Analysis

Regression to the mean refers to a phenomena where natural variation within an individual can mistakenly appear as meaningful change over time. To illustrate, imagine a patient who comes in for a regular check-up and is found to have high blood sugar levels. This may be cause for concern, and the doctor recommends several dietary adjustments and schedules a follow-up for the next week. During the follow-up visit, the patient’s blood sugar levels have seemingly returned to a normal range.

statistical methods, R, simulation, Laura Jamison

Simulating Multilevel Data

The Structure of Multilevel Data

The term “multilevel data” refers to data organized in a hierarchical structure, where units of analysis are grouped into clusters. For example, in a cross-sectional study, multilevel data could be made up of individual measurements of students from different schools, where students are nested within schools. In a longitudinal study, multilevel data could be made up of multiple time point measurements of individuals, where time points are nested within individuals.

mixed effect models, simulation, R, lme4, statistical methods, Laura Jamison

How to Use Docker for Study Reproducibility with R Markdown

Docker is a software product that allows for the efficient building, packaging, and deployment of applications. It uses containers, which are isolated environments that bundle software and its dependencies. These containers can run an application with all the same software, dependencies, settings, and more as were on the original machine on any other computer without affecting the host system. In this regard Docker is different from a virtual machine in that it does not require a guest operating system.

R, R Markdown, Docker, reproducibility, Laura Jamison

Theil-Sen Regression: Programming and Understanding an Outlier-Resistant Alternative to Least Squares

Least squares is so frequently the method by which linear regressions are estimated that in many write-ups of analyses, explicit mention of the method is omitted. Authors save the ink or pixels otherwise consumed by “least squares” and let it simply be inferred. This is an understandable elision: You could make good money repeatedly betting that when someone says that they fit a linear regression, they did so via least squares. But alternative estimation methods are on offer—and are sometimes preferable.

R, simulation, statistical methods, nonparametric statistics, Theil-Sen regression, Jacob Goldstein-Greenwood

Getting Started with Analysis of Covariance

The Analysis of Covariance, or ANCOVA, is a regression model that includes both categorical and numeric predictors, often just one of each. It is commonly used to analyze a follow-up numeric response after exposure to various treatments, controlling for a baseline measure of that same response. For example, given two subjects with the same baseline value of the study outcome, one in a treated group and the other in a control group, will the subjects have different follow-up outcomes on average?

R, effect plots, power analysis, statistical methods, ANCOVA, ANOVA, Clay Ford

Bootstrap Estimates of Confidence Intervals

Bootstrapping is a statistical procedure that utilizes resampling (with replacement) of a sample to infer properties of a wider population.

Python, statistical methods, confidence intervals, bootstrap, Samantha Lomuscio