StatLab Articles

Assessing Model Assumptions with Lineup Plots

When fitting a linear model we make two assumptions about the distribution of residuals:

statistical methods, diagnostic plots, qqplot, visualization, R, Clay Ford

Bootstrapping Residuals for Linear Models with Heteroskedastic Errors Invites Trouble

Bootstrapping—resampling data with replacement and recomputing quantities of interest—lets analysts approximate sampling distributions for complex estimators and frees them of the reliably unmet assumptions of traditional, parametric inferential statistics. It’s an elegant, intuitive approach in which an analyst exploits the (often) parallel resample-to-sample and sample-to-population relationships to understand uncertainty in an estimate.

R, statistical methods, simulation, bootstrap, nonparametric statistics, Jacob Goldstein-Greenwood

Power and Sample Size Estimation for Logistic Regression

In this article we demonstrate how to use simulation in R to estimate power and sample size for proposed logistic regression models that feature two binary predictors and their interaction.

Recall that logistic regression attempts to model the probability of an event conditional on the values of predictor variables. If we have a binary response, y, and two predictors, x and z, that interact, we specify the logistic regression model as follows:

R, logistic regression, power analysis, simulation, statistical methods, Clay Ford

Understanding Somers' D

When it comes to summarizing the association between two numeric variables, we can use Pearson or Spearman correlation. When accompanied with a scatterplot, they allow us to quantify association on a scale from -1 to 1. But what if we have two ordered categorical variables with just a few levels? How can we summarize their association? One approach is to calculate Somers’ Delta, or Somers’ D for short.

R, statistical methods, AUC, logistic regression, Clay Ford

Starting with Non-Metric Multidimensional Scaling (NMDS)

Real world problems and data are complex and there are often situations where we want to simultaneously look at relationships between more than one variable. We call these analyses multivariate statistics. Non-metric multidimensional scaling, or NMDS, is one multivariate technique that allows us to visualize these complex relationships in less dimensions. In other words, NMDS takes complex, multivariate data and represents the relationships in a way that is easier for interpretation.

R, NMDS, statistical methods, Non-Metric Multidimensional Scaling, Lauren Brideau

Graphical Linearity Assessment for One- and Two-Predictor Logistic Regressions

Logistic regression is a flexible tool for modeling binary outcomes. A logistic regression describes the probability, \(P\), of 1/“yes”/“success” (versus 0/“no”/“failure”) as a linear combination of predictors:

\[log(\frac{P}{1-P}) = B_0 + B_1X_1 + B_2X_2 + ... + B_kX_k\]

R, logistic regression, simulation, model assessment, statistical methods, Jacob Goldstein-Greenwood

Why Preallocate Memory in R Loops?

In R, “growing” an object—extending an atomic vector one element at a time; adding elements one by one to the end of a list; etc.—is an easy way to elicit a mild admonishment from someone reviewing or revising your code. Growing most frequently occurs in the context of for loops: A loop computes a value (or set of values) on each iteration, and it then appends the value(s) to an existing object.

R, simulation, preallocation, optimization, Jacob Goldstein-Greenwood

Regression to the Mean and Change Score Analysis

Regression to the mean refers to a phenomena where natural variation within an individual can mistakenly appear as meaningful change over time. To illustrate, imagine a patient who comes in for a regular check-up and is found to have high blood sugar levels. This may be cause for concern, and the doctor recommends several dietary adjustments and schedules a follow-up for the next week. During the follow-up visit, the patient’s blood sugar levels have seemingly returned to a normal range.

statistical methods, R, simulation, Laura Jamison

Simulating Multilevel Data

The Structure of Multilevel Data

The term “multilevel data” refers to data organized in a hierarchical structure, where units of analysis are grouped into clusters. For example, in a cross-sectional study, multilevel data could be made up of individual measurements of students from different schools, where students are nested within schools. In a longitudinal study, multilevel data could be made up of multiple time point measurements of individuals, where time points are nested within individuals.

mixed effect models, simulation, R, lme4, statistical methods, Laura Jamison

How to Use Docker for Study Reproducibility with R Markdown

Docker is a software product that allows for the efficient building, packaging, and deployment of applications. It uses containers, which are isolated environments that bundle software and its dependencies. These containers can run an application with all the same software, dependencies, settings, and more as were on the original machine on any other computer without affecting the host system. In this regard Docker is different from a virtual machine in that it does not require a guest operating system.

R, R Markdown, Docker, reproducibility, Laura Jamison