StatLab Articles

Getting Started with Binomial Generalized Linear Mixed Models

Binomial generalized linear mixed models, or binomial GLMMs, are useful for modeling binary outcomes for repeated or clustered measures. For example, let’s say we design a study that tracks what college students eat over the course of 2 weeks, and we’re interested in whether or not they eat vegetables each day. For each student, we’ll have 14 binary events: eat vegetables or not.

R, logistic regression, mixed effect models, simulation, statistical methods, binomial GLMM, Clay Ford

Getting Started with Web Scraping in Python

"Web scraping," or "data scraping," is simply the process of extracting data from a website. This can, of course, be done manually: You could go to a website, find the relevant data or information, and enter that information into some data file that you have stored locally. But imagine that you want to pull a very large dataset or data from hundreds or thousands of individual URLs. In this case, extracting the data manually sounds overwhelming and time-consuming.

Python, data wrangling, web scraping, Hannah Lewis

A Brief on Brier Scores

Not all predictions are created equal, even if, in categorical terms, the predictions suggest the same outcome: “X will (or won’t) happen.” Say that I estimate that there’s a 60% chance that 100 million COVID-19 vaccines will be administered in the US during the first 100 days of Biden’s presidency, but my friend estimates that there’s a 90% chance of that outcome.

R, statistical methods, Brier scores, Jacob Goldstein-Greenwood

Getting Started with pandas in Python

The pandas package is an open-source software library written for data analysis in Python. Pandas allows users to import data from various file formats (comma-separated values, JSON, SQL, fits, etc.) and perform data manipulation operations, including cleaning and reshaping the data, summarizing observations, grouping data, and merging multiple datasets. In this article, we'll explore briefly some of the most commonly used functions and methods for understanding, formatting, and vizualizing data with the pandas package.

Python, data wrangling, pandas, matplotlib, Hannah Lewis

Understanding Multiple Comparisons and Simultaneous Inference

When it comes to confidence intervals and hypothesis testing there are two important limitations to keep in mind.

The significance level,1 \(\alpha\), or the confidence interval coverage, \(1 - \alpha\),

R, simulation, statistical methods, multiple comparisons, multcomp, Clay Ford

Understanding Robust Standard Errors

What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.

R, Stata, linear regression, simulation, statistical methods, Clay Ford

Getting Started with Multinomial Logit Models

Multinomial logit models allow us to model membership in a group based on known variables. For example, the operating system preferences of a university’s students could be classified as “Windows,” “Mac,” or “Linux.” Perhaps we would like to better understand why students choose one OS versus another. We might want to build a statistical model that allows us to predict the probability of selecting an OS based on information such as sex, major, financial aid, and so on. Multinomial logit modeling allows us to propose and fit such models.

R, effect plots, statistical methods, multinomial logistic regression, baseline logit models, Clay Ford

Understanding Empirical Cumulative Distribution Functions

What are empirical cumulative distribution functions and what can we do with them? To answer the first question, let’s first step back and make sure we understand "distributions", or more specifically, "probability distributions".

A Basic Probability Distribution

Imagine a simple event, say flipping a coin 3 times. Here are all the possible outcomes, where H = head and T = tails:

R, statistical methods, ECDF, qqplot, Clay Ford

Getting Started with Rate Models

Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time, perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for auto accidents?

R, effect plots, statistical methods, rate models, count regression, Clay Ford