Stata Basics: Create, Recode and Label Variables

In this article we demonstrate how to create new variables, recode existing variables, and label variables and values of variables. We work with the census.dta data that is included with Stata to provide examples.

generate: create variables

Here we use the generate command to create a new variable representing the population younger than 18 years old. We do so by summing up the two existing variables: poplt5 (population < 5 years old) and pop5_17 (population of 5 to 17 years old).


* Load data census.dta 
sysuse census.dta

* See the information of census.dta
describe


Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:            13                          6 Apr 2014 15:43
 size:         2,900                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop             long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
-----------------------------------------------------------------------------
Sorted by: 


* Create a new variable pop0_17 representing youth population 
generate pop0_17 = poplt5 + pop5_17

* Summary statistics for the three variables
summarize poplt5 pop5_17 pop0_17


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      poplt5 |         50    326277.8    331585.1      35998    1708400
     pop5_17 |         50    945951.6    959372.8      91796    4680558
     pop0_17 |         50     1272229     1289731     130745    6388958


* order: reorder variables
order state state2 region pop poplt5 pop0_17 

replace: replace contents of existing variables

Here we create the youth population variable again, but this time we make it into thousands and replace the one we just created.


replace pop0_17 = pop0_17/1000


(50 real changes made)

summarize pop0_17 

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     pop0_17 |         50    1272.229    1289.731    130.745   6388.958


Recode variables

Say we want to break pop (total population) into three categories. First we use the tabulate command to see the frequencies of this variable.


tabulate pop


 Population |      Freq.     Percent        Cum.
------------+-----------------------------------
    401,851 |          1        2.00        2.00
    469,557 |          1        2.00        4.00
    511,456 |          1        2.00        6.00
    594,338 |          1        2.00        8.00
    652,717 |          1        2.00       10.00
    690,768 |          1        2.00       12.00
    786,690 |          1        2.00       14.00
    800,493 |          1        2.00       16.00
    920,610 |          1        2.00       18.00
    943,935 |          1        2.00       20.00
    947,154 |          1        2.00       22.00
    964,691 |          1        2.00       24.00
    1124660 |          1        2.00       26.00
    1302894 |          1        2.00       28.00
    1461037 |          1        2.00       30.00
    1569825 |          1        2.00       32.00
    1949644 |          1        2.00       34.00
    2286435 |          1        2.00       36.00
    2363679 |          1        2.00       38.00
    2520638 |          1        2.00       40.00
    2633105 |          1        2.00       42.00
    2718215 |          1        2.00       44.00
    2889964 |          1        2.00       46.00
    2913808 |          1        2.00       48.00
    3025290 |          1        2.00       50.00
    3107576 |          1        2.00       52.00
    3121820 |          1        2.00       54.00
    3660777 |          1        2.00       56.00
    3893888 |          1        2.00       58.00
    4075970 |          1        2.00       60.00
    4132156 |          1        2.00       62.00
    4205900 |          1        2.00       64.00
    4216975 |          1        2.00       66.00
    4591120 |          1        2.00       68.00
    4705767 |          1        2.00       70.00
    4916686 |          1        2.00       72.00
    5346818 |          1        2.00       74.00
    5463105 |          1        2.00       76.00
    5490224 |          1        2.00       78.00
    5737037 |          1        2.00       80.00
    5881766 |          1        2.00       82.00
    7364823 |          1        2.00       84.00
    9262078 |          1        2.00       86.00
    9746324 |          1        2.00       88.00
   1.08e+07 |          1        2.00       90.00
   1.14e+07 |          1        2.00       92.00
   1.19e+07 |          1        2.00       94.00
   1.42e+07 |          1        2.00       96.00
   1.76e+07 |          1        2.00       98.00
   2.37e+07 |          1        2.00      100.00
------------+-----------------------------------
      Total |         50      100.00

Then we create a new variable called pop_c and transform the original variable pop into three categories.


generate pop_c = .


(50 missing values generated)


replace pop_c = 1 if (pop <= 2000000) 


(17 real changes made) 


replace pop_c = 2 if (pop >= 2000001) & (pop <= 4800000) 


(18 real changes made)


replace pop_c = 3 if (pop >= 4800001)


(15 real changes made)


* See if our recoding worked correctly
tabulate pop pop_c


           |              pop_c
Population |         1          2          3 |     Total
-----------+---------------------------------+----------
   401,851 |         1          0          0 |         1 
   469,557 |         1          0          0 |         1 
   511,456 |         1          0          0 |         1 
   594,338 |         1          0          0 |         1 
   652,717 |         1          0          0 |         1 
   690,768 |         1          0          0 |         1 
   786,690 |         1          0          0 |         1 
   800,493 |         1          0          0 |         1 
   920,610 |         1          0          0 |         1 
   943,935 |         1          0          0 |         1 
   947,154 |         1          0          0 |         1 
   964,691 |         1          0          0 |         1 
   1124660 |         1          0          0 |         1 
   1302894 |         1          0          0 |         1 
   1461037 |         1          0          0 |         1 
   1569825 |         1          0          0 |         1 
   1949644 |         1          0          0 |         1 
   2286435 |         0          1          0 |         1 
   2363679 |         0          1          0 |         1 
   2520638 |         0          1          0 |         1 
   2633105 |         0          1          0 |         1 
   2718215 |         0          1          0 |         1 
   2889964 |         0          1          0 |         1 
   2913808 |         0          1          0 |         1 
   3025290 |         0          1          0 |         1 
   3107576 |         0          1          0 |         1 
   3121820 |         0          1          0 |         1 
   3660777 |         0          1          0 |         1 
   3893888 |         0          1          0 |         1 
   4075970 |         0          1          0 |         1 
   4132156 |         0          1          0 |         1 
   4205900 |         0          1          0 |         1 
   4216975 |         0          1          0 |         1 
   4591120 |         0          1          0 |         1 
   4705767 |         0          1          0 |         1 
   4916686 |         0          0          1 |         1 
   5346818 |         0          0          1 |         1 
   5463105 |         0          0          1 |         1 
   5490224 |         0          0          1 |         1 
   5737037 |         0          0          1 |         1 
   5881766 |         0          0          1 |         1 
   7364823 |         0          0          1 |         1 
   9262078 |         0          0          1 |         1 
   9746324 |         0          0          1 |         1 
  1.08e+07 |         0          0          1 |         1 
  1.14e+07 |         0          0          1 |         1 
  1.19e+07 |         0          0          1 |         1 
  1.42e+07 |         0          0          1 |         1 
  1.76e+07 |         0          0          1 |         1 
  2.37e+07 |         0          0          1 |         1 
-----------+---------------------------------+----------
     Total |        17         18         15 |        50 

We can use the recode command to recode variables as well. Here we create another new variable called pop_c2 then do the recode in the same manner as we did for pop_c.


generate pop_c2 = pop

recode pop_c2 (min/2000000=1) (2000001/4800000=2) (4800001/max=3)


(pop_c2: 50 changes made)


* Summary statistics for the two recoded variables
summarize pop_c pop_c2


    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       pop_c |         50        1.96    .8071113          1          3
      pop_c2 |         50        1.96    .8071113          1          3

If you are not happy with the original variable name of total population, you can change it by using the rename command. Here we rename pop as pop_t.


rename pop pop_t 

Label variables and values

Now that we have some new variables created or recoded from original variables, we should label them so we know what the new levels represent. This is good practice even if you are the only person using the dataset. The labels can serve as basic "documentation" of the dataset.


* See which variables need to be labeled
describe


Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state
 vars:            16                          6 Apr 2014 15:43
 size:         3,500                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop_t           long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop0_17         float   %9.0g                 
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
pop_c           float   %9.0g                 
pop_c2          float   %9.0g                 
----------------------------------------------------------------------------
Sorted by: 


* Label variable 
label variable pop0_17 "Pop, < 18 years" 
label variable pop_c "Categorized population"

* Remember we categorized pop_c into three categories: 1,2 and 3
table pop_c


----------------------
Categoriz |
ed        |
populatio |
n         |      Freq.
----------+-----------
        1 |         17
        2 |         18
        3 |         15
----------------------

Let's label them as low, medium and high.


* Label values
* First we define those labels
label define popcl 1 "low" 2 "medium" 3 "high"
 
* Then we attach the value label popcl to the variable pop_c
label values pop_c popcl 
 
* Now the three categories are presented as low, medium and high 
table pop_c 


----------------------
Categoriz |
ed        |
populatio |
n         |      Freq.
----------+-----------
      low |         17
   medium |         18
     high |         15
----------------------


* Remove the duplicated variable pop_c2 
drop pop_c2
 
* You can also label the dataset
label data "1980 Census data by state: v2"

* see the information of the dataset 
describe 


Contains data from /Applications/Stata/ado/base/c/census.dta
  obs:            50                          1980 Census data by state: v2
 vars:            15                          6 Apr 2014 15:43
 size:         3,300                          
---------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------
state           str14   %-14s                 State
state2          str2    %-2s                  Two-letter state abbreviation
region          int     %-8.0g     cenreg     Census region
pop_t           long    %12.0gc               Population
poplt5          long    %12.0gc               Pop, < 5 year
pop0_17         float   %9.0g                 Pop, < 18 years
pop5_17         long    %12.0gc               Pop, 5 to 17 years
pop18p          long    %12.0gc               Pop, 18 and older
pop65p          long    %12.0gc               Pop, 65 and older
popurban        long    %12.0gc               Urban population
medage          float   %9.2f                 Median age
death           long    %12.0gc               Number of deaths
marriage        long    %12.0gc               Number of marriages
divorce         long    %12.0gc               Number of divorces
pop_c           float   %9.0g      popcl      Categorized population
----------------------------------------------------------------------------
Sorted by: 


References

  • StataCorp. (2017). Stata Statistical Software: Release 15. College Station, TX: StataCorp LLC.
  • StataCorp. (2017). Stata 15 Base Reference Manual. College Station, TX: Stata Press.

Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library
October 14, 2016
Updated May 23, 2023


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.