StatLab Articles

Look People Are Going to Think... (Debate Rhetoric Redux)

I'm still looking at the rhetoric from the presidential debates, this time focusing on the first general election debate between Hillary Clinton and Donald Trump.

quanteda, R, text analysis, text mining, visualization, Michele Claibourn

Using a Census Data API with R

Data sets provided by the US Census Bureau, such as the Decennial Census and American Community Survey (ACS), are widely used by researchers, among others. You can certainly find and download census data from the Census Bureau website, from the licensed data source Social Explorer, or from other free sources such as IPUMS-USA and then load the data into a statistical package or other software to analyze or present the data.

R, data wrangling, visualization, Yun Tai

Debate Prep!

I'm teaching a Text as Data short course (using R) right now, and as a card-carrying political scientist, I couldn't resist using the ongoing campaign as an example (this was, in part, a way of handling my own anxiety about last Monday's debate---this is what I was doing while watching). So here goes...

quanteda, R, text analysis, text mining, Michele Claibourn

Getting Started with Exploratory Factor Analysis

Take a look at the following correlation matrix for Olympic decathlon data calculated from 280 scores from 1960 through 2004 (Johnson & Wichern, 2007, p. 499):

R, statistical methods, factor analysis, Clay Ford

An Introduction to Loglinear Models

Loglinear models model cell counts in contingency tables. They're a little different from other modeling methods in that they don't distinguish between response and explanatory variables. All variables in a loglinear model are essentially "responses."

To learn more about loglinear models, we'll explore the following data from Agresti (1996, Table 6.3). It summarizes responses from a survey that asked high school seniors in a particular city whether they had ever used alcohol, cigarettes, or marijuana.

R, statistical methods, loglinear models, Clay Ford

Setting up Color Palettes in R

Plotting with color in R is kind of like painting a room in your house: You have to pick some colors. R has some default colors ready to go, but it's only natural to want to play around and try some different combinations. In this article, we'll look at some ways you can define new color palettes for plotting in R.

To begin, let's use the palette() function to see what colors are currently available:

R, visualization, Clay Ford

Getting Started with Hurdle Models

Hurdle Models are a class of models for count data that help handle excess zeros and overdispersion. To motivate their use, let's look at some data in R. The following data come with the AER package. It is a sample of 4,406 individuals, aged 66 and over, who were covered by Medicare in 1988. One of the variables the data provide is number of physician office visits.

statistical methods, visualization, hurdle models, negative binomial regression, poisson regression, rootograms, R, Clay Ford

Hierarchical Linear Regression

Note: This post is not about hierarchical linear modeling (HLM; multilevel modeling). Hierarchical regression is model comparison of nested regression models.

R, linear regression, statistical methods, hierarchical regression, model comparison, Bommae Kim

Getting Started with Negative Binomial Regression Modeling

When it comes to modeling counts (i.e., whole numbers greater than or equal to 0), we often start with Poisson regression. This is a generalized linear model where a response is assumed to have a Poisson distribution conditional on a weighted sum of predictors. For example, we might model the number of documented concussions to NFL quarterbacks as a function of snaps played and the total years experience of his offensive line. However, one potential drawback of Poisson regression is that it may not accurately describe the variability of the counts.

R, statistical methods, visualization, count regression, negative binomial regression, poisson regression, rootograms, Clay Ford

Visualizing the Effects of Logistic Regression

Logistic regression is a popular and effective way of modeling a binary response. For example, we might wonder what influences a person to volunteer, or not volunteer, for psychological research. Some do, some don’t. Are there independent variables that would help explain or distinguish between those who volunteer and those who don’t? Logistic regression gives us a mathematical model that we can we use to estimate the probability of someone volunteering given certain independent variables.

R, effect plots, logistic regression, visualization, interactions, Clay Ford