When performing data analysis, we often need to “reshape” our data from wide format to long format. A common example of wide data is a data structure with one record per subject and multiple columns for repeated measures. For example:
id gender race trt day1 day2 day3
1 F 0 0 19.81310 18.05777 14.84996
2 M 0 0 17.91846 18.75825 15.30547
3 M 0 0 17.22526 19.79218 15.10622
Notice that columns day1, day2 and day3 represent repeated measures for each person.
An alternative way to layout these data is listing one record per subject per day. In this case we have one column indicating day of measure and one column for the measure itself, like this:
id gender race trt day amt
1 F 0 0 day1 19.81310
1 F 0 0 day2 18.05777
1 F 0 0 day3 14.84996
2 M 0 0 day1 17.91846
2 M 0 0 day2 18.75825
2 M 0 0 day3 15.30547
3 M 0 0 day1 17.22526
3 M 0 0 day2 19.79218
3 M 0 0 day3 15.10622
An advantage to having data in long format is that graphing and statistical modeling are frequently made easier. For example, the long format allows us to include a variable for day in our model. In the wide format, there is no explicit variable for day. It’s actually embedded in the column headers of day1, day2 and day3. Storing data in long format also allows us to quickly subset our data without dropping columns. We could subset the long data above to easily see all amounts on day 2 for males with race = 0 and trt = 1. To do the same with the wide data would mean dropping the day1 and day2 columns after subsetting.
How do you reshape wide data to long? Below we give basic demonstrations using R, SAS, SPSS and Stata to perform the reshaping demonstrated above. In each case we assume you’re starting with a csv file called dat.csv. Feel free to download the file and try the code below. It’s identical to the wide format data displayed above.
R
While base R has the reshape()
function, we think it’s easier to use the pivot_longer()
function from the {tidyr} package. The first argument takes the data frame we want to reshape. The next argument, cols, lists the columns we wish to reshape. Notice we can indicate a span of columns using the colon (:) operator. The names_to argument says that column names (day1, day2, day3) will be placed in a column called “day”. The values_to argument says the values in the day1, day2, and day3 columns will be placed in a single column called “amt”.
dat <- read.csv("dat.csv")
install.packages("tidyr") # only need to do once
library(tidyr)
datL <- pivot_longer(dat, cols = day1:day3,
names_to = "day", values_to = "amt")
SAS
One way to reshape data in SAS is using PROC TRANSPOSE. List the columns that need to be reshaped in the var statement. Next in the by statement list the columns in the wide data that should remain in the long data. The out=
in the PROC TRANSPOSE statement creates a new data set called datLong. The (rename=(col1=Measurement _name_=day))
option renames the new column headers in the long data set. Otherwise SAS provides the default columns names “col1” and “name”.
proc import datafile="C:\Users\mst3k\Documents\dat.csv" out=dat replace;
getnames=yes;
run;
proc transpose data=dat
out=datLong (rename=(col1=Measurement _name_=day));
var day1-day3;
by id gender race trt;
run;
SPSS
In SPSS it’s possible to reshape your data using the Restructure Data Wizard (Data…Restructure…). However we find it easier to simply use syntax. Once you read in the CSV file, open up a new syntax window (File…New…Syntax), type the code below, and then highlight the code and click Ctrl + R to run it. The VARSTOCASES example below (read as “vars to cases”) uses three subcommands: /MAKE, /INDEX and /KEEP. The /MAKE line says to create a column called “amt” from the columns day1 - day3. The /INDEX line says name the new column “day”, which will contain the column headers from the wide data set. The /KEEP line lists which variables from the wide data set are to remain as-is in the long data set.
VARSTOCASES
/MAKE amt FROM day1 day2 day3
/INDEX=day(amt)
/KEEP=id gender race trt.
Stata
Stata offers the reshape command for restructuring data. To reshape a wide data set long, you have to specify reshape long
. After that you specify the word kernel that the multiple columns we want to reshape have in common. In our case that’s “day” (day1, day2, day3). The Stata reshape command apparently relies on this naming convention. In our example this works fine. However Stata uses that common word kernel as the name of the new column containing the gathered values. So we can’t just name our column containing the day number “day”. Stata throws an error. Therefore below we assign the day values to a column called “time”. Use the i()
option to supply the variable that uniquely identifies records in the wide data set and use the j()
option to supply the name of the variable that will store the numbers from the day1, day2, and day3 column headers. Afterward we rename our two new columns to match the output from the other examples.
import delimited dat.csv, clear
reshape long day, i(id) j(time)
rename day amt
rename time day
Clay Ford
Statistical Research Consultant
University of Virginia Library
July 21, 2014
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.