# Jacob Goldstein-Greenwood

## Why Preallocate Memory in R Loops?

In R, “growing” an object—extending an atomic vector one element at a time; adding elements one by one to the end of a list; etc.—is an easy way to elicit a mild admonishment from someone reviewing or revising your code. Growing most frequently occurs in the context of for loops: A loop computes a value (or set of values) on each iteration, and it then appends the value(s) to an existing object.

## Theil-Sen Regression: Programming and Understanding an Outlier-Resistant Alternative to Least Squares

Least squares is so frequently the method by which linear regressions are estimated that in many write-ups of analyses, explicit mention of the method is omitted. Authors save the ink or pixels otherwise consumed by “least squares” and let it simply be inferred. This is an understandable elision: You could make good money repeatedly betting that when someone says that they fit a linear regression, they did so via least squares. But alternative estimation methods are on offer—and are sometimes preferable.

## The Shortcomings of Standardized Regression Coefficients

Analysts and researchers occasionally want to compare the magnitudes of different predictive or causal effects estimated via regression. But comparison is a tricky endeavor when predictor variables are measured on different scales: If y is predicted from x and z, with x measured in kilograms and z measured in years, what does the relative size of the variables’ regression coefficients communicate about which variable is “more strongly” associated with y?

## Continuity Corrections: Imperfect Responses to Slight Problems

R users who have run base R’s prop.test() function to perform a null hypothesis test of a proportion—as when assessing whether a coin is weighted toward heads or whether more than half of the wines a vineyard sold in a given month were reds—may have noticed curious language in the output: The default test is reported as having been performed with a “continuity correction.”

## Nonparametric and Parametric Power: Comparing the Wilcoxon Test and the t-test

From 2004 to 2008, a series of four brief, disagreeing papers in the journal Medical Education took up the question of whether and when it’s appropriate to analyze data from Likert scales (i.e., integers reflecting degrees of agreement with statements) with parametric or nonparametric statistical methods.

## Detecting Influential Points in Regression with DFBETA(S)

In regression modeling, influential points are observations that, individually, exert large effects on a model’s results—the parameter estimates ($$\hat{\beta_0}, \hat{\beta_1}, ..., \hat{\beta_j}$$) and, consequently, the model’s predictions ($$\hat{y_1}, \hat{y_2}, ..., \hat{y_i}$$).

## ROC Curves and AUC for Models Used for Binary Classification

This article assumes basic familiarity with the use and interpretation of logistic regression, odds and probabilities, and true/false positives/negatives. The examples are coded in R. ROC curves and AUC have important limitations, and I encourage reading through the section at the end of the article to get a sense of when and why the tools can be of limited use.

## The Intuition Behind Confidence Intervals

Say it with me: An X% confidence interval captures the population parameter in X% of repeated samples.

In the course of our statistical educations, many of us had that line (or some variant of it) crammed, wedged, stuffed, and shoved into our skulls until definitional precision was leaking out of noses and pooling on our upper lips like prop blood.

Or, at least, I felt that way.