In this article we demonstrate how to create new variables, recode existing variables, and label variables and values of variables. We work with the census.dta data that is included with Stata to provide examples.
generate: create variables
Here we use the generate
command to create a new variable representing the population younger than 18 years old. We do so by summing up the two existing variables: poplt5 (population < 5 years old) and pop5_17 (population of 5 to 17 years old).
* Load data census.dta
sysuse census.dta
* See the information of census.dta
describe
Contains data from /Applications/Stata/ado/base/c/census.dta
obs: 50 1980 Census data by state
vars: 13 6 Apr 2014 15:43
size: 2,900
---------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------
state str14 %-14s State
state2 str2 %-2s Two-letter state abbreviation
region int %-8.0g cenreg Census region
pop long %12.0gc Population
poplt5 long %12.0gc Pop, < 5 year
pop5_17 long %12.0gc Pop, 5 to 17 years
pop18p long %12.0gc Pop, 18 and older
pop65p long %12.0gc Pop, 65 and older
popurban long %12.0gc Urban population
medage float %9.2f Median age
death long %12.0gc Number of deaths
marriage long %12.0gc Number of marriages
divorce long %12.0gc Number of divorces
-----------------------------------------------------------------------------
Sorted by:
* Create a new variable pop0_17 representing youth population
generate pop0_17 = poplt5 + pop5_17
* Summary statistics for the three variables
summarize poplt5 pop5_17 pop0_17
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
poplt5 | 50 326277.8 331585.1 35998 1708400
pop5_17 | 50 945951.6 959372.8 91796 4680558
pop0_17 | 50 1272229 1289731 130745 6388958
* order: reorder variables
order state state2 region pop poplt5 pop0_17
replace: replace contents of existing variables
Here we create the youth population variable again, but this time we make it into thousands and replace the one we just created.
replace pop0_17 = pop0_17/1000
(50 real changes made)
summarize pop0_17
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
pop0_17 | 50 1272.229 1289.731 130.745 6388.958
Recode variables
Say we want to break pop (total population) into three categories. First we use the tabulate
command to see the frequencies of this variable.
tabulate pop
Population | Freq. Percent Cum.
------------+-----------------------------------
401,851 | 1 2.00 2.00
469,557 | 1 2.00 4.00
511,456 | 1 2.00 6.00
594,338 | 1 2.00 8.00
652,717 | 1 2.00 10.00
690,768 | 1 2.00 12.00
786,690 | 1 2.00 14.00
800,493 | 1 2.00 16.00
920,610 | 1 2.00 18.00
943,935 | 1 2.00 20.00
947,154 | 1 2.00 22.00
964,691 | 1 2.00 24.00
1124660 | 1 2.00 26.00
1302894 | 1 2.00 28.00
1461037 | 1 2.00 30.00
1569825 | 1 2.00 32.00
1949644 | 1 2.00 34.00
2286435 | 1 2.00 36.00
2363679 | 1 2.00 38.00
2520638 | 1 2.00 40.00
2633105 | 1 2.00 42.00
2718215 | 1 2.00 44.00
2889964 | 1 2.00 46.00
2913808 | 1 2.00 48.00
3025290 | 1 2.00 50.00
3107576 | 1 2.00 52.00
3121820 | 1 2.00 54.00
3660777 | 1 2.00 56.00
3893888 | 1 2.00 58.00
4075970 | 1 2.00 60.00
4132156 | 1 2.00 62.00
4205900 | 1 2.00 64.00
4216975 | 1 2.00 66.00
4591120 | 1 2.00 68.00
4705767 | 1 2.00 70.00
4916686 | 1 2.00 72.00
5346818 | 1 2.00 74.00
5463105 | 1 2.00 76.00
5490224 | 1 2.00 78.00
5737037 | 1 2.00 80.00
5881766 | 1 2.00 82.00
7364823 | 1 2.00 84.00
9262078 | 1 2.00 86.00
9746324 | 1 2.00 88.00
1.08e+07 | 1 2.00 90.00
1.14e+07 | 1 2.00 92.00
1.19e+07 | 1 2.00 94.00
1.42e+07 | 1 2.00 96.00
1.76e+07 | 1 2.00 98.00
2.37e+07 | 1 2.00 100.00
------------+-----------------------------------
Total | 50 100.00
Then we create a new variable called pop_c and transform the original variable pop into three categories.
generate pop_c = .
(50 missing values generated)
replace pop_c = 1 if (pop <= 2000000)
(17 real changes made)
replace pop_c = 2 if (pop >= 2000001) & (pop <= 4800000)
(18 real changes made)
replace pop_c = 3 if (pop >= 4800001)
(15 real changes made)
* See if our recoding worked correctly
tabulate pop pop_c
| pop_c
Population | 1 2 3 | Total
-----------+---------------------------------+----------
401,851 | 1 0 0 | 1
469,557 | 1 0 0 | 1
511,456 | 1 0 0 | 1
594,338 | 1 0 0 | 1
652,717 | 1 0 0 | 1
690,768 | 1 0 0 | 1
786,690 | 1 0 0 | 1
800,493 | 1 0 0 | 1
920,610 | 1 0 0 | 1
943,935 | 1 0 0 | 1
947,154 | 1 0 0 | 1
964,691 | 1 0 0 | 1
1124660 | 1 0 0 | 1
1302894 | 1 0 0 | 1
1461037 | 1 0 0 | 1
1569825 | 1 0 0 | 1
1949644 | 1 0 0 | 1
2286435 | 0 1 0 | 1
2363679 | 0 1 0 | 1
2520638 | 0 1 0 | 1
2633105 | 0 1 0 | 1
2718215 | 0 1 0 | 1
2889964 | 0 1 0 | 1
2913808 | 0 1 0 | 1
3025290 | 0 1 0 | 1
3107576 | 0 1 0 | 1
3121820 | 0 1 0 | 1
3660777 | 0 1 0 | 1
3893888 | 0 1 0 | 1
4075970 | 0 1 0 | 1
4132156 | 0 1 0 | 1
4205900 | 0 1 0 | 1
4216975 | 0 1 0 | 1
4591120 | 0 1 0 | 1
4705767 | 0 1 0 | 1
4916686 | 0 0 1 | 1
5346818 | 0 0 1 | 1
5463105 | 0 0 1 | 1
5490224 | 0 0 1 | 1
5737037 | 0 0 1 | 1
5881766 | 0 0 1 | 1
7364823 | 0 0 1 | 1
9262078 | 0 0 1 | 1
9746324 | 0 0 1 | 1
1.08e+07 | 0 0 1 | 1
1.14e+07 | 0 0 1 | 1
1.19e+07 | 0 0 1 | 1
1.42e+07 | 0 0 1 | 1
1.76e+07 | 0 0 1 | 1
2.37e+07 | 0 0 1 | 1
-----------+---------------------------------+----------
Total | 17 18 15 | 50
We can use the recode
command to recode variables as well. Here we create another new variable called pop_c2 then do the recode in the same manner as we did for pop_c.
generate pop_c2 = pop
recode pop_c2 (min/2000000=1) (2000001/4800000=2) (4800001/max=3)
(pop_c2: 50 changes made)
* Summary statistics for the two recoded variables
summarize pop_c pop_c2
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
pop_c | 50 1.96 .8071113 1 3
pop_c2 | 50 1.96 .8071113 1 3
If you are not happy with the original variable name of total population, you can change it by using the rename
command. Here we rename pop as pop_t.
rename pop pop_t
Label variables and values
Now that we have some new variables created or recoded from original variables, we should label them so we know what the new levels represent. This is good practice even if you are the only person using the dataset. The labels can serve as basic "documentation" of the dataset.
* See which variables need to be labeled
describe
Contains data from /Applications/Stata/ado/base/c/census.dta
obs: 50 1980 Census data by state
vars: 16 6 Apr 2014 15:43
size: 3,500
---------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------
state str14 %-14s State
state2 str2 %-2s Two-letter state abbreviation
region int %-8.0g cenreg Census region
pop_t long %12.0gc Population
poplt5 long %12.0gc Pop, < 5 year
pop0_17 float %9.0g
pop5_17 long %12.0gc Pop, 5 to 17 years
pop18p long %12.0gc Pop, 18 and older
pop65p long %12.0gc Pop, 65 and older
popurban long %12.0gc Urban population
medage float %9.2f Median age
death long %12.0gc Number of deaths
marriage long %12.0gc Number of marriages
divorce long %12.0gc Number of divorces
pop_c float %9.0g
pop_c2 float %9.0g
----------------------------------------------------------------------------
Sorted by:
* Label variable
label variable pop0_17 "Pop, < 18 years"
label variable pop_c "Categorized population"
* Remember we categorized pop_c into three categories: 1,2 and 3
table pop_c
----------------------
Categoriz |
ed |
populatio |
n | Freq.
----------+-----------
1 | 17
2 | 18
3 | 15
----------------------
Let's label them as low, medium and high.
* Label values
* First we define those labels
label define popcl 1 "low" 2 "medium" 3 "high"
* Then we attach the value label popcl to the variable pop_c
label values pop_c popcl
* Now the three categories are presented as low, medium and high
table pop_c
----------------------
Categoriz |
ed |
populatio |
n | Freq.
----------+-----------
low | 17
medium | 18
high | 15
----------------------
* Remove the duplicated variable pop_c2
drop pop_c2
* You can also label the dataset
label data "1980 Census data by state: v2"
* see the information of the dataset
describe
Contains data from /Applications/Stata/ado/base/c/census.dta
obs: 50 1980 Census data by state: v2
vars: 15 6 Apr 2014 15:43
size: 3,300
---------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------
state str14 %-14s State
state2 str2 %-2s Two-letter state abbreviation
region int %-8.0g cenreg Census region
pop_t long %12.0gc Population
poplt5 long %12.0gc Pop, < 5 year
pop0_17 float %9.0g Pop, < 18 years
pop5_17 long %12.0gc Pop, 5 to 17 years
pop18p long %12.0gc Pop, 18 and older
pop65p long %12.0gc Pop, 65 and older
popurban long %12.0gc Urban population
medage float %9.2f Median age
death long %12.0gc Number of deaths
marriage long %12.0gc Number of marriages
divorce long %12.0gc Number of divorces
pop_c float %9.0g popcl Categorized population
----------------------------------------------------------------------------
Sorted by:
References
- StataCorp. (2017). Stata Statistical Software: Release 15. College Station, TX: StataCorp LLC.
- StataCorp. (2017). Stata 15 Base Reference Manual. College Station, TX: Stata Press.
Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library
October 14, 2016
Updated May 23, 2023
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.