StatLab Articles

Getting Started with pandas in Python

The pandas package is an open-source software library written for data analysis in Python. Pandas allows users to import data from various file formats (comma-separated values, JSON, SQL, fits, etc.) and perform data manipulation operations, including cleaning and reshaping the data, summarizing observations, grouping data, and merging multiple datasets. In this article, we’ll explore briefly some of the most commonly used functions and methods for understanding, formatting, and visualizing data with the pandas package.

Python, data wrangling, pandas, matplotlib, Hannah Lewis

Understanding Multiple Comparisons and Simultaneous Inference

When it comes to confidence intervals and hypothesis testing there are two important limitations to keep in mind.

The significance level,¹ \(\alpha\), or the confidence interval coverage, \(1 - \alpha\),

only apply to one test or estimate, not to a series of tests or estimates.
are only appropriate if the estimate or test was not suggested by the data.

Let’s illustrate both of these limitations via simulation using R.

R, simulation, statistical methods, multiple comparisons, multcomp, Clay Ford

Data Scientist as Cartographer: An Introduction to Making Interactive Maps in R with Leaflet

Note: This version of the article contains static images of maps generated with Leaflet. You can view a version with interactive maps here.

R, visualization, GIS, leaflet, Jacob Goldstein-Greenwood

Understanding Robust Standard Errors

What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.

R, Stata, linear regression, simulation, statistical methods, Clay Ford

Getting Started with Multinomial Logit Models

Multinomial logit models allow us to model membership in a group based on known variables. For example, the operating system preferences of a university’s students could be classified as “Windows,” “Mac,” or “Linux.” Perhaps we would like to better understand why students choose one OS versus another. We might want to build a statistical model that allows us to predict the probability of selecting an OS based on information such as sex, major, financial aid, and so on. Multinomial logit modeling allows us to propose and fit such models.

R, effect plots, statistical methods, multinomial logistic regression, baseline logit models, Clay Ford

Understanding Empirical Cumulative Distribution Functions

What are empirical cumulative distribution functions and what can we do with them? To answer the first question, let’s first step back and make sure we understand "distributions", or more specifically, "probability distributions".

A Basic Probability Distribution

Imagine a simple event, say flipping a coin 3 times. Here are all the possible outcomes, where H = head and T = tails:

R, statistical methods, ECDF, qqplot, Clay Ford

Getting Started with Rate Models

Let’s say we’re interested in modeling the number of auto accidents that occur at various intersections within a city. Upon collecting data after a certain period of time, perhaps we notice two intersections have the same number of accidents, say 25. Is it correct to conclude these two intersections are similar in their propensity for auto accidents?

R, effect plots, statistical methods, rate models, count regression, Clay Ford

Getting Started with Regular Expressions

Regular expressions (or regex) are tools for matching patterns in character strings. These can be useful for finding words or letter patterns in text, parsing filenames for specific information, and interpreting input formatted in a variety of ways (e.g., phone numbers). The syntax of regular expressions is generally recognized across operating systems and programming languages. Although this tutorial will use R for examples, most of the same regex could be used in unix commands or in Python.

R, data wrangling, regular expressions, Margot Bjoring

Getting Started with Shiny

Shiny is an R package that facilitates the creation of interactive web apps using R code, which can be hosted locally, on the shinyapps server, or on your own server. Shiny apps can range from extremely simple to incredibly sophisticated. They can be written purely with R code or supplemented with HTML, CSS, or JavaScript. Visit R studio’s shiny gallery to view some examples.

R, visualization, Shiny, Laura White

Databases for Data Scientists

As data scientists, we’re often most excited about the final layer of analysis. Once all the data are cleaned and stored in a format readable by our favorite programming language (Python, R, STATA, etc.), the most fun part of our work is when we’re finding counter-intuitive causations with statistical methods. If you can prove that the mutual presence of McDonalds really does prevent wars between countries or that an increase in diversity really does boost business profitability, that is a good day.

data wrangling, SQL, Srikar Gullapalli

Research Data Services

Want updates in your inbox? Subscribe to our monthly Research Data Services Newsletter!