Stata
What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.
There are times we need to do some repetitive tasks in the process of data preparation, analysis, or presentation. For instance, we may need to compute a set of variables in the same manner, rename or create a series of variables, or repetitively recode values of a number of variables. In this post, we show a few simple example "loops" using the Stata commands foreach
, local
and forvalues
to handle some common repetitive tasks.
In this post, we demonstrate how to convert datasets between wide form and long form. This is also known as "reshaping data". Reshaping is often needed when you work with datasets that contain variables with some kinds of sequences, say, time-series data. It is fairly easy to transform data between wide and long forms in Stata using the reshape
command, however you'll want to be careful when you do so to eliminate possible mistakes in the process of transforming. First, let's see how the wide and long forms look.
When we first start working with data, usually in a statistics class, we mostly use clean and completed datasets as examples. Later on, we realize data is not always clean or complete when doing research or data analysis for other purposes. In reality, we often need to put two or more datasets together to begin whatever statistical analysis tasks we would like to perform. In this post, we demonstrate how to combine datasets using append
and merge
, which are row-wise combining and column-wise combining, respectively.
Sometimes only parts of a dataset mean something to you. In this post, we show you how to subset a dataset in Stata by variables or by observations. We use the census.dta dataset installed with Stata as the sample data.
In this article we demonstrate how to create new variables, recode existing variables, and label variables and values of variables. We work with the census.dta data that is included with Stata to provide examples.
generate: create variables
Here we use the generate
command to create a new variable representing the population younger than 18 years old. We do so by summing up the two existing variables: poplt5 (population < 5 years old) and pop5_17 (population of 5 to 17 years old).
In Stata, the first step of analyzing a dataset is opening the data in Stata so that it knows which file you are working with. Yes, you can simply double click on a Stata data file that ends in .dta to open it, but we prefer to write syntax so we can easily reproduce the same work or use the scripts again when working on similar tasks. In this post, we introduce methods of reading in, using, and saving Stata and other formats of data files.
Cronbach's alpha is a measure used to assess the reliability, or internal consistency, of a set of scale or test items. In other words, the reliability of any given measurement refers to the extent to which it is a consistent measure of a concept, and Cronbach’s alpha is one way of measuring the strength of that consistency.
When we think of regression, we usually think of linear regression, the tried and true method for estimating a mean of some variable conditional on the levels or values of independent variables. In other words, we're pretty sure the mean of our variable of interest differs depending on other variables. For example, the mean weight of 1st-year UVA males is some unknown value. But we could in theory take a random sample and discover there is a relationship between weight and height.
An important component of data analysis is graphing. Stata provides excellent graphics facility for quickly exploring and visualizing your data. For example, let's load the auto data set that comes with Stata (1978 Automobile Data) and make two scatterplots and then two boxplots:
When performing data analysis, we often need to "reshape" our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example: