# statistical methods

Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation.

In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.

Association rule mining is a machine learning technique designed to uncover associations between categorical variables in data. It produces association rules which reflect how frequently categories are found together. For instance, a common application of association rule mining is the “Frequently Bought Together” section of the checkout page from an online store. Through mining across transactions from that store, items that are frequently bought together have been identified (e.g., shaving cream and razors).

When fitting a linear model we make two assumptions about the distribution of residuals:

Bootstrapping—resampling data with replacement and recomputing quantities of interest—lets analysts approximate sampling distributions for complex estimators and frees them of the reliably unmet assumptions of traditional, parametric inferential statistics. It’s an elegant, intuitive approach in which an analyst exploits the (often) parallel resample-to-sample and sample-to-population relationships to understand uncertainty in an estimate.

In this article we demonstrate how to use simulation in R to estimate power and sample size for proposed logistic regression models that feature two binary predictors and their interaction.

Recall that logistic regression attempts to model the probability of an event conditional on the values of predictor variables. If we have a binary response, *y*, and two predictors, *x* and *z*, that interact, we specify the logistic regression model as follows:

When it comes to summarizing the association between two numeric variables, we can use Pearson or Spearman correlation. When accompanied with a scatterplot, they allow us to quantify association on a scale from -1 to 1. But what if we have two *ordered categorical* variables with just a few levels? How can we summarize their association? One approach is to calculate *Somers’ Delta*, or *Somers’ D* for short.

Real world problems and data are complex and there are often situations where we want to simultaneously look at relationships between more than one variable. We call these analyses multivariate statistics. Non-metric multidimensional scaling, or NMDS, is one multivariate technique that allows us to visualize these complex relationships in less dimensions. In other words, NMDS takes complex, multivariate data and represents the relationships in a way that is easier for interpretation.

Logistic regression is a flexible tool for modeling binary outcomes. A logistic regression describes the probability, \(P\), of 1/“yes”/“success” (versus 0/“no”/“failure”) as a linear combination of predictors:

\[log(\frac{P}{1-P}) = B_0 + B_1X_1 + B_2X_2 + ... + B_kX_k\]

Regression to the mean refers to a phenomena where natural variation within an individual can mistakenly appear as meaningful change over time. To illustrate, imagine a patient who comes in for a regular check-up and is found to have high blood sugar levels. This may be cause for concern, and the doctor recommends several dietary adjustments and schedules a follow-up for the next week. During the follow-up visit, the patient’s blood sugar levels have seemingly returned to a normal range.

The term “multilevel data” refers to data organized in a hierarchical structure, where units of analysis are grouped into clusters. For example, in a cross-sectional study, multilevel data could be made up of individual measurements of students from different schools, where students are nested within schools. In a longitudinal study, multilevel data could be made up of multiple time point measurements of individuals, where time points are nested within individuals.

Least squares is so frequently the method by which linear regressions are estimated that in many write-ups of analyses, explicit mention of the method is omitted. Authors save the ink or pixels otherwise consumed by “least squares” and let it simply be inferred. This is an understandable elision: You could make good money repeatedly betting that when someone says that they fit a linear regression, they did so via least squares. But alternative estimation methods are on offer—and are sometimes preferable.

The Analysis of Covariance, or ANCOVA, is a regression model that includes both categorical and numeric predictors, often just one of each. It is commonly used to analyze a follow-up numeric response after exposure to various treatments, controlling for a baseline measure of that same response. For example, given two subjects with the same baseline value of the study outcome, one in a treated group and the other in a control group, will the subjects have different follow-up outcomes on average?

Bootstrapping is a statistical procedure that utilizes resampling (with replacement) of a sample to infer properties of a wider population.

A Simple Slopes Analysis is a follow-up procedure to regression modeling that helps us investigate and interpret “significant” interactions. The analysis is often employed for interactions between two numeric predictors, but it can be applied to other types of interactions as well. To motivate why we might be interested in this type of analysis, consider the following research question:

Does the length of time in a managerial position (X) and a manager’s ability (Z) help explain or predict a manager’s self-assurance (Y)?

Analysts and researchers occasionally want to compare the magnitudes of different predictive or causal effects estimated via regression. But comparison is a tricky endeavor when predictor variables are measured on different scales: If *y* is predicted from *x* and *z*, with *x* measured in kilograms and *z* measured in years, what does the relative size of the variables’ regression coefficients communicate about which variable is “more strongly” associated with *y*?

In this article we demonstrate how to simulate data suitable for a multinomial logistic regression model using R. One reason to do this is to gain a better understanding of how multinomial logistic regression models work. Another is to simulate data for the purposes of estimating power and sample size for a planned experiment that will involve a multinomial logistic regression analysis.

When designing an experiment it’s good practice to estimate the number of subjects or observations we’ll need. If we recruit or collect too few, our analysis may be too uncertain or misleading. If we collect too many, we potentially waste time and expense on diminishing returns. The optimal sample size provides enough information to allow us to analyze our research questions with confidence. The traditional approach to sample size estimation is based on *hypothesis tests*.

R users who have run base R’s `prop.test()`

function to perform a null hypothesis test of a proportion—as when assessing whether a coin is weighted toward heads or whether more than half of the wines a vineyard sold in a given month were reds—may have noticed curious language in the output: The default test is reported as having been performed with a “continuity correction.”

I’ve heard something frightening from practicing statisticians who frequently use mixed effects models. Sometimes when I ask them whether they produced a [semi]variogram to check the correlation structure they reply “what’s that?” -Frank Harrell

From 2004 to 2008, a series of four brief, disagreeing papers in the journal *Medical Education* took up the question of whether and when it’s appropriate to analyze data from Likert scales (i.e., integers reflecting degrees of agreement with statements) with parametric or nonparametric statistical methods.

In this article, we plan to get you up and running with gamma regression. But before we dive into that, let’s review the familiar normal distribution. This will provide some scaffolding to help us transition to the gamma distribution.

If you have ever performed binary logistic regression in R using the `glm()`

function, you may have noticed a summary of “Deviance Residuals” at the top of the summary output. In this article, we talk about how these residuals are calculated and what we can use them for. We also talk about other types of residuals available for binary logistic regression.

## What is Logistic Regression?

Logistic regression is a predictive analysis that estimates/models the probability of event occurring based on a given dataset. This dataset contains both independent variables, or predictors, and their corresponding dependent variables, or responses.

Let’s say we fit a logistic regression model for the purpose of predicting the probability of low infant birth weight, which is an infant weighing less than 2.5 kg. Below we fit such a model using the `birthwt`

data set that comes with the MASS package in R. (This is an example model and not to be used as medical advice.)

We first subset the data to select four variables:

In regression modeling, influential points are observations that, individually, exert large effects on a model’s results—the parameter estimates (\(\hat{\beta_0}, \hat{\beta_1}, ..., \hat{\beta_j}\)) and, consequently, the model’s predictions (\(\hat{y_1}, \hat{y_2}, ..., \hat{y_i}\)).

Occasionally we are asked to help students or faculty implement a mixed-effect model in SPSS. Our training and expertise is primarily in R, so it can be challenging to transfer and apply our knowledge to SPSS. In this article we document for posterity how to fit some basic mixed-effect models in R using the lme4 and nlme packages, and how to replicate the results in SPSS.

In this article we work with R 4.2.0, lme4 version 1.1-29, nlme version 3.1-157, and SPSS version 28.0.1.1.

*This article assumes basic familiarity with the use and interpretation of logistic regression, odds and probabilities, and true/false positives/negatives. The examples are coded in R. ROC curves and AUC have important limitations, and I encourage reading through the section at the end of the article to get a sense of when and why the tools can be of limited use.*

There are many medical tests for detecting the presence of a disease or condition. Some examples include tests for lesions, cancer, pregnancy, or COVID-19. While these tests are usually accurate, they’re not perfect. In addition, some tests are designed to detect the same condition, but use a different method. A recent example are PCR and antigen tests for COVID-19. In these cases we might want to compare the two tests on the same subjects. This is known as a *paired study design*.

If you have ever used the R package lme4 to perform mixed-effect modeling you may have noticed the “Correlation of Fixed Effects” section at the bottom of the summary output. This article intends to shed some light on what this section means and how you might interpret it.

One of the most well-known statistical tests to analyze the differences between means of given groups is the ANOVA (analysis of variance) test. While ANOVA is a great tool, it assumes that the data in question follows a normal distribution. What if your data doesn’t follow a normal distribution or if your sample size is too small to determine a normal distribution? That’s where the Kruskal-Wallis test comes in.

What are *average marginal effects*? If we unpack the phrase, it looks like we have *effects* that are *marginal* to something, all of which we *average*. So let’s look at each piece of this phrase and see if we can help you get a better handle on this topic.

Say it with me: *An X% confidence interval captures the population parameter in X% of repeated samples.*

In the course of our statistical educations, many of us had that line (or some variant of it) crammed, wedged, stuffed, and shoved into our skulls until definitional precision was leaking out of noses and pooling on our upper lips like prop blood.

Or, at least, I felt that way.

The power of a test is the probability of correctly rejecting a null hypothesis. For example, let’s say we suspect a coin is not fair and lands heads 65% of the time.

It is well documented that post hoc power calculations are not useful (Althouse, 2020; Goodman & Berlin, 1994; Hoenig & Heisey, 2001). Also known as observed power or retrospective power, post hoc power purports to estimate the power of a test given an observed effect size. The idea is to show that a “non-significant” hypothesis test failed to achieve significance because it wasn’t powerful enough. This allows researchers to entertain the notion that their hypothesized effect may actually exist; they just needed to use a bigger sample size.

Consider the following data from the text *Design and Analysis of Experiments, 7th ed.* (Montgomery, 2009, Table 3.1). It has two variables: `power`

and `rate`

. `power`

is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. `rate`

is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate.

Generalized estimating equations, or GEE, is a method for modeling longitudinal or clustered data. It is usually used with non-normal data such as binary or count data. The name refers to a set of equations that are solved to obtain parameter estimates (i.e., model coefficients). If interested, see Agresti (2002) for the computational details. In this article we simply aim to get you started with implementing and interpreting GEE using the R statistical computing environment.

Binomial generalized linear mixed models, or binomial GLMMs, are useful for modeling binary outcomes for repeated or clustered measures. For example, let’s say we design a study that tracks what college students eat over the course of 2 weeks, and we’re interested in whether or not they eat vegetables each day. For each student, we’ll have 14 binary events: eat vegetables or not.

Not all predictions are created equal, even if, in categorical terms, the predictions suggest the same outcome: “X will (or won’t) happen.” Say that I estimate that there’s a 60% chance that 100 million COVID-19 vaccines will be administered in the US during the first 100 days of Biden’s presidency, but my friend estimates that there’s a 90% chance of that outcome.

When it comes to confidence intervals and hypothesis testing there are two important limitations to keep in mind.

The significance level,^{1} \(\alpha\), or the confidence interval coverage, \(1 - \alpha\),

What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.

Multinomial logit models allow us to model membership in a group based on known variables. For example, the operating system preferences of a university’s students could be classified as “Windows,” “Mac,” or “Linux.” Perhaps we would like to better understand why students choose one OS versus another. We might want to build a statistical model that allows us to predict the probability of selecting an OS based on information such as sex, major, financial aid, and so on. Multinomial logit modeling allows us to propose and fit such models.

What are empirical cumulative distribution functions and what can we do with them? To answer the first question, let’s first step back and make sure we understand "distributions", or more specifically, "probability distributions".

**A Basic Probability Distribution**

Imagine a simple event, say flipping a coin 3 times. Here are all the possible outcomes, where H = head and T = tails:

Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time, perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for auto accidents?

One of the basic assumptions of linear modeling is constant, or *homogeneous*, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.

Below we create a sorted vector of numbers ranging from 1 to 10 called `x`

, and then create a vector of numbers called `y`

that is a function of `x`

. When we plot `x`

vs `y`

, we get a straight line with an intercept of 1.2 and a slope of 2.1.

A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a positive whole number, which means we need to use special distributions to generate the data.

Logistic regression is a method for modeling binary data as a function of other variables. For example we might want to model the occurrence or non-occurrence of a disease given predictors such as age, race, weight, etc. The result is a model that returns a predicted probability of occurrence (or non-occurrence, depending on how we set up our data) given certain values of our predictors. We might also be able to interpret the coefficients in our model to summarize how a change in one predictor affects the odds of occurrence.

Whenever we are dealing with a dataset, we almost always run into a problem that may decrease our confidence in the results that we are getting - missing data! Examples of missing data can be found in surveys - where respondents intentionally refrained from answering a question, didn’t answer a question because it is not applicable to them, or simply forgot to give an answer. Or our dataset on trade in agricultural products for country-pairs over years could suffer from missing data as some countries fail to report their accounts for certain years.

The paper *Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors* by Andrew Gelman and John Carlin introduces the idea of performing design calculations to help prevent researchers from being misled by statistically significant results in studies with small samples and/or noisy measurements.

Log transformations are often recommended for skewed data, such as monetary measures or certain biological and demographic measures. Log transforming data usually has the effect of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 US states. It is skewed to the right due to Alaska, California, Texas and a few others.

*Note: This article demonstrates how to use propensity scores for matching data. However, propensity scores have come under fire in recent years. In their 2019 article, Why Propensity Scores Should Not Be Used for Matching, King and Nielsen argue that propensity scores increase imbalance, inefficiency, model dependence, and bias. Others argue that while propensity scores may be sub-optimal, they can be useful in certain situations.*

In a previous post we demonstrated how to perform a basic mediation analysis. In this post we look at performing a moderated mediation analysis. The basic idea is that a mediator may depend on another variable called a "moderator". For example, in our mediation analysis post we hypothesized that self-esteem was a mediator of student grades on the effect of student happiness. We illustrate this below with a path diagram.

Multivariate Multiple Regression is a method of modeling multiple responses, or dependent variables, with a single set of predictor variables. For example, we might want to model both math and reading SAT scores as a function of gender, race, parent income, and so forth. This allows us to evaluate the relationship of, say, gender with each score. You may be thinking, "why not just run separate regressions for each dependent variable?" That's actually a good idea! And in fact that's pretty much what multivariate multiple regression does.

Proportional-odds logistic regression is often used to model an ordered categorical response. By "ordered", we mean categories that have a natural ordering, such as "Disagree", "Neutral", "Agree", or "Everyday", "Some days", "Rarely", "Never". For a primer on proportional-odds logistic regression, see our post, Fitting and Interpreting a Proportional Odds Model.

The Wilcoxon Rank Sum Test is often described as the non-parametric version of the two-sample t-test. You sometimes see it in analysis flowcharts after a question such as "is your data normal?" A "no" branch off this question will recommend a Wilcoxon test if you're comparing two groups of continuous measures.

So what is this Wilcoxon test? What makes it non-parametric? What does that even mean? And how do we implement it and interpret it? Those are some of the questions we aim to address in this post.

Pairwise comparison means comparing all pairs of something. If I have three items, A, B and C, that means comparing A to B, A to C, and B to C. Given *n* items, I can determine the number of possible pairs using the binomial coefficient: $$ \frac{n!}{2!(n - 2)!} = \binom {n}{2}$$ Using the R statistical computing environment, we can use the `choose()`

function to quickly calculate this.

Take a look at the following correlation matrix for Olympic decathlon data calculated from 280 scores from 1960 through 2004 (Johnson & Wichern, 2007, p. 499):

Loglinear models model cell counts in contingency tables. They're a little different from other modeling methods in that they don't distinguish between response and explanatory variables. All variables in a loglinear model are essentially "responses."

To learn more about loglinear models, we'll explore the following data from Agresti (1996, Table 6.3). It summarizes responses from a survey that asked high school seniors in a particular city whether they had ever used alcohol, cigarettes, or marijuana.

Hurdle Models are a class of models for count data that help handle excess zeros and overdispersion. To motivate their use, let's look at some data in R. The following data come with the AER package. It is a sample of 4,406 individuals, aged 66 and over, who were covered by Medicare in 1988. One of the variables the data provide is number of physician office visits.

*Note: This post is not about hierarchical linear modeling (HLM; multilevel modeling). Hierarchical regression is model comparison of nested regression models.*

When it comes to modeling counts (i.e., whole numbers greater than or equal to 0), we often start with Poisson regression. This is a generalized linear model where a response is assumed to have a Poisson distribution conditional on a weighted sum of predictors. For example, we might model the number of documented concussions to NFL quarterbacks as a function of snaps played and the total years experience of his offensive line. However, one potential drawback of Poisson regression is that it may not accurately describe the variability of the counts.

*This post intends to introduce the basics of mediation analysis and does not explain statistical details. For details, please refer to the articles at the end of this post.*

## What is mediation?

Let’s say previous studies have suggested that higher grades predict higher happiness: X (grades) → Y (happiness). (This research example is made up for illustration purposes. Please don’t consider it a scientific statement.)

When doing linear modeling or ANOVA it’s useful to examine whether or not the effect of one variable depends on the level of one or more variables. If it does then we have what is called an “interaction”. This means variables combine or interact to affect the response. The simplest type of interaction is the interaction between two two-level categorical variables. Let’s say we have gender (male and female), treatment (yes or no), and a continuous response measure. If the response to treatment depends on gender, then we have an interaction.

The classic two-by-two table displays counts of what may be called “successes” and “failures” versus some two-level grouping variable, such as sex (male and female) or treatment (placebo and active drug). An example of one such table is given in the book *An Introduction to Categorical Data Analysis* (Agresti, 1996, p. 20). The table classifies myocardial infarction (Yes/No) with treatment group (Placebo/Aspirin).

Cronbach's alpha is a measure used to assess the reliability, or internal consistency, of a set of scale or test items. In other words, the reliability of any given measurement refers to the extent to which it is a consistent measure of a concept, and Cronbach’s alpha is one way of measuring the strength of that consistency.

On Thursday, October 15, 2015, a disbelieving student posted on Reddit: My stats professor just went on a rant about how R-squared values are essentially useless, is there any truth to this? It attracted a fair amount of attention, at least compared to other posts about statistics on Reddit.

Take a look at the following table. It is a cross tabulation of data taken from the 1991 General Social Survey that relates political party affiliation to political ideology (Agresti, *An Introduction to Categorical Data Analysis*, 1996).

You ran a linear regression analysis and the stats software spit out a bunch of numbers. The results were significant (or not). You might think that you’re done with analysis. No, not yet. After running a regression analysis, you should check if the model works well for the data.

When we think of regression, we usually think of linear regression, the tried and true method for estimating a mean of some variable conditional on the levels or values of independent variables. In other words, we're pretty sure the mean of our variable of interest differs depending on other variables. For example, the mean weight of 1st-year UVA males is some unknown value. But we could in theory take a random sample and discover there is a relationship between weight and height.

When I first learned data analysis, I always checked normality for each variable and made sure they were normally distributed before running any analyses, such as *t*-test, ANOVA, or linear regression. I thought normal distribution of variables was the important assumption to proceed to analyses. That’s why stats textbooks show you how to draw histograms and QQ-plots in the beginning of data analysis in the early chapters and see if variables are normally distributed, isn’t it?

First off, what is endogeneity, and why would we want to simulate it?

Endogeneity occurs when a statistical model has an independent variable that is correlated with the error term. The reason we would want to simulate it is to understand what exactly that definition means!

Let’s first simulate ideal data for simple linear regression using R.

The QQ plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a normal or exponential. For example, if we run a statistical analysis that assumes our residuals are normally distributed, we can use a normal QQ plot to check that assumption. It's just a visual check, not an air-tight proof, so it is somewhat subjective.

One of the assumptions of the Analysis of Variance (ANOVA) is *constant variance*. That is, the spread of residuals is roughly equal per treatment level. A common way to assess this assumption is plotting residuals versus fitted values. Recall that residuals are the observed values of your response of interest minus the predicted values of your response. In a one-way ANOVA, this is simply the observed values minus the treatment group mean. For example, below we have a plot of residuals versus fitted values for a one-way ANOVA.