data wrangling

Note: This article was written about version 2.0.0 of the reprex package.

"Web scraping," or "data scraping," is simply the process of extracting data from a website. This can, of course, be done manually: You could go to a website, find the relevant data or information, and enter that information into some data file that you have stored locally. But imagine that you want to pull a very large dataset or data from hundreds or thousands of individual URLs. In this case, extracting the data manually sounds overwhelming and time-consuming.

The pandas package is an open-source software library written for data analysis in Python. Pandas allows users to import data from various file formats (comma-separated values, JSON, SQL, fits, etc.) and perform data manipulation operations, including cleaning and reshaping the data, summarizing observations, grouping data, and merging multiple datasets. In this article, we'll explore briefly some of the most commonly used functions and methods for understanding, formatting, and vizualizing data with the pandas package.

Regular expressions (or regex) are tools for matching patterns in character strings. These can be useful for finding words or letter patterns in text, parsing filenames for specific information, and interpreting input formatted in a variety of ways (e.g., phone numbers). The syntax of regular expressions is generally recognized across operating systems and programming languages. Although this tutorial will use R for examples, most of the same regex could be used in unix commands or in Python.

As data scientists, we’re often most excited about the final layer of analysis. Once all the data are cleaned and stored in a format readable by our favorite programming language (Python, R, STATA, etc.), the most fun part of our work is when we’re finding counter-intuitive causations with statistical methods. If you can prove that the mutual presence of McDonalds really does prevent wars between countries or that an increase in diversity really does boost business profitability, that is a good day.

When you import or load data into R, the data are stored in random-access memory (RAM). This is the memory that is deleted when you close R or shut off your computer. It’s very fast but temporary. If you save your data, it is saved to your hard drive. But when you open R again and load the data, once again it is loaded into RAM. While many newer computers come with lots of RAM (such as 16 GB), it’s not an infinite amount. When you open RStudio, you’re using RAM even if no data is loaded. Open a web browser or any other program and they too are loaded into RAM.

This post is something I’ve been thinking about writing for a while. I was inspired to write it by my own trials and tribulations, which are still ongoing, while working with the QGIS API, trying to programmatically do stuff in QGIS instead of relying on available widgets and plugins. I have spent, and will probably continue to spend, many hours scouring the internet and especially Stack Overflow looking for answers of how to use various classes, methods, attributes, etc.

Recently, I have taken the dive into python scripting in QGIS. QGIS is a really nice open source (and free!) alternative to ESRI's ArcGIS. While QGIS is a little quirky and generally not quite as user friendly as ArcGIS, it still provides nearly the same functionality. Personally, I've become a fan of it and now have even taught a short, 1 credit course in the University of Virginia's Batten School of Public Policy titled: GIS for Public Policy.

If you're wondering what exactly the purrr package does, then this blog post is for you.

Sometimes we have data with dates and/or times that we want to manipulate or summarize. A common example in the health sciences is time-in-study. A subject may enter a study on February 12, 2008, and exit on November 4, 2009. How many days was the person in the study? (Don’t forget 2008 was a leap year; February had 29 days.) What was the median time-in-study for all subjects?

There are times we need to do some repetitive tasks in the process of data preparation, analysis, or presentation. For instance, we may need to compute a set of variables in the same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, we show a few simple example "loops" using the Stata commands foreach, local and forvalues to handle some common repetitive tasks.

In this post, we demonstrate how to convert datasets between wide form and long form. This is also known as "reshaping data". Reshaping is often needed when you work with datasets that contain variables with some kinds of sequences, say, time-series data. It is fairly easy to transform data between wide and long forms in Stata using the reshape command, however you'll want to be careful when you do so to eliminate possible mistakes in the process of transforming. First, let's see how the wide and long forms look.

When we first start working with data, usually in a statistics class, we mostly use clean and completed datasets as examples. Later on, we realize data is not always clean or complete when doing research or data analysis for other purposes. In reality, we often need to put two or more datasets together to begin whatever statistical analysis tasks we would like to perform. In this post, we demonstrate how to combine datasets using append and merge, which are row-wise combining and column-wise combining, respectively.

Sometimes only parts of a dataset mean something to you. In this post, we show you how to subset a dataset in Stata by variables or by observations. We use the census.dta dataset installed with Stata as the sample data.

In this article we demonstrate how to create new variables, recode existing variables, and label variables and values of variables. We work with the census.dta data that is included with Stata to provide examples.

generate: create variables

Here we use the generate command to create a new variable representing the population younger than 18 years old. We do so by summing up the two existing variables: poplt5 (population < 5 years old) and pop5_17 (population of 5 to 17 years old).

In Stata, the first step of analyzing a dataset is opening the data in Stata so that it knows which file you are working with. Yes, you can simply double click on a Stata data file that ends in .dta to open it, but we prefer to write syntax so we can easily reproduce the same work or use the scripts again when working on similar tasks. In this post, we introduce methods of reading in, using, and saving Stata and other formats of data files.

Data.gov catalogs US government data and makes them available on the web; you can find data on a variety of topics such as agriculture, business, climate, education, energy, finance, public safety, and many more. It is a good starting point for finding data if you don’t already know which particular data source to begin your search with; however, it can still be time consuming when it comes to actually downloading the raw data you need. Fortunately, Data.gov also includes APIs from across the government, which can help with obtaining raw datasets.

Data sets provided by the US Census Bureau, such as the Decennial Census and American Community Survey (ACS), are widely used by researchers, among others. You can certainly find and download census data from the Census Bureau website, from the licensed data source Social Explorer, or from other free sources such as IPUMS-USA and then load the data into a statistical package or other software to analyze or present the data.

When performing data analysis, we often need to "reshape" our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example: