Data sets provided by the US Census Bureau, such as the Decennial Census and American Community Survey (ACS), are widely used by researchers, among others. You can certainly find and download census data from the Census Bureau website, from the licensed data source Social Explorer, or from other free sources such as IPUMS-USA and then load the data into a statistical package or other software to analyze or present the data. Alternatively, you can do all of the above, from downloading to presenting, in one platform---in this case, R---by utilizing the APIs provided by the Census Bureau. It can be a bit of a learning process to do so if you have no or very limited experience with APIs and R. In this post, I share a few examples of using Census Bureau APIs with R to obtain census datasets. Many Census Bureau datasets are available via API---we will use the Decennial Census 2010 API in the following examples.
Before running this script, you’ll need to install the RJSONIO package if you haven’t done so before. Make sure your machine is connected to the internet, and run install.packages("RJSONIO")
---you only need to do this once.
API key
Get your API key at: https://www.census.gov/data/developers.html. Make sure to plug in your own API key in the following R code.
Working directory and R package
Set the working directory on your computer (the path to where you want R to read/store files), and load the RJSONIO package.
# Set working directory
setwd('~/DataApiR') # plug in the working directory on your machine
# Load package
library(RJSONIO)
Extract state level data
Here we extract the total population, white population and black population of Alabama. To look up other variables, see the list of Census 2010 variables: https://api.census.gov/data/2010/dec/sf1/variables.html.
# call for total population, white population and black population of Alabama
# total population = P003001, white population = P003002,
# black population = P003003;
# FIPS code of Alabama = 01
resURL <- "https://api.census.gov/data/2010/dec/sf1?key=[YOUR KEY]&get=P003001,P003002,P003003&for=state:01"
# convert JSON content to R objects
ljson <- fromJSON(resURL)
# see the extracted data
ljson
[[1]]
[1] "P003001" "P003002" "P003003" "state"
[[2]]
[1] "4779736" "3275394" "1251311" "01"
Extract county level data
Now let’s try to retrieve county level data in the state of Virginia for the same variables.
# call for total population, white population and black population of each county in Virginia
resURL <- "https://api.census.gov/data/2010/dec/sf1?key=[YOUR KEY]&get=P003001,P003002,P003003&for=county:*&in=state:51"
# convert and see first few rows of the data
ljson <- fromJSON(resURL)
head(ljson,3)
[[1]]
[1] "P0030001" "P0030002" "P0030003" "state" "county"
[[2]]
[1] "33164" "21662" "9303" "51" "001"
[[3]]
[1] "98970" "79738" "9600" "51" "003"
Function to extract data
We can write a function to retrieve census data and convert them to a data frame. Again, we will extract county level data in the state of Virginia for the same variables.
# function to retrieve and convert data
getData <- function(APIkey,state,varname){
resURL <- paste("https://api.census.gov/data/2010/dec/sf1?get=",varname,
"&for=county:*&in=state:",state,"&key=",APIkey,sep="")
lJSON <- fromJSON(resURL) # convert JSON content to R objects
lJSON <- lJSON[2:length(lJSON)] # keep everything but the 1st element (var names) in lJSON
lJSON.cou <- sapply(lJSON,function(x) x[5]) # extract county
lJSON.tot <- sapply(lJSON,function(x) x[1]) # extract values of the variable for each county
lJSON.whi <- sapply(lJSON,function(x) x[2])
lJSON.bla <- sapply(lJSON,function(x) x[3])
df <- data.frame(lJSON.cou, as.numeric(lJSON.tot),as.numeric(lJSON.whi),
as.numeric(lJSON.bla)) # create data frame with counties and values
names(df) <- c("county","tpop","wpop","bpop") # name the fields/vars in the data frame
return(df)
}
# API key for census data
APIkey <- "yourAPIkey"
# state code (Virginia)
state <- 51
# variables
varname <- paste("P003001","P003002","P003003",sep=",")
# call the function
vapop <- getData(APIkey,state,varname)
# see the first few rows
head(vapop)
county tpop wpop bpop
1 001 33164 21662 9303
2 003 98970 79738 9600
3 005 16250 15145 761
4 007 12690 9332 2932
5 009 32353 24829 6148
6 011 14973 11597 3007
That’s probably all you need for the purpose of getting census data, but let’s do a bit more to try some simple mapping of census data. To run the rest of the lines, you will need to install the rgdal, dplyr, and tmap packages.
Mapping census data
First, we’ll need to obtain shape files of Virginia counties so that we can plot the numeric data on a map. The shape files can be downloaded at the Census Bureau website: https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2010&layergroup=Counties+%28and+equivalent%29 (select Virginia from the 2010 County and Equivalent drop-down, and then click Download). Save the downloaded shapefiles to your working directory.
# load package: rgdal
library(rgdal)
# Use readOGR() to read in spatial data:
# dsn (data source name): specifies the directory in which the file is stored
# layer: specifies the file name
vacounty <- readOGR(dsn="tl_2010_51_county10", layer="tl_2010_51_county10")
OGR data source with driver: ESRI Shapefile
Source: "tl_2010_51_county10", layer: "tl_2010_51_county10"
with 134 features
It has 17 fields
class(vacounty)
[1] "SpatialPolygonsDataFrame"
attr(,"package")
[1] "sp"
# features: rows/observations
# fields: columns/variables
# plot vacounty
plot(vacounty)
names(vacounty) # list field names
[1] "STATEFP10" "COUNTYFP10" "COUNTYNS10" "GEOID10" "NAME10"
[6] "NAMELSAD10" "LSAD10" "CLASSFP10" "MTFCC10" "CSAFP10"
[11] "CBSAFP10" "METDIVFP10" "FUNCSTAT10" "ALAND10" "AWATER10"
[16] "INTPTLAT10" "INTPTLON10"
Now we have the VA county shape files ready. Let’s “join” our numeric data to the shape files so that we can plot them.
# Join vapop (attributes) to vacounty (shapefile with attributes)
# load package: dplyr
library(dplyr)
# See if the rows in the two objects match; uses the %in% command to identify which
# values in an object are also contained in another
vacounty$COUNTYFP10 %in% vapop$county
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[71] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[99] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[113] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[127] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# Join the the two datasets and check the first few rows of the joined result
head(left_join(vacounty@data, vapop, by=c("COUNTYFP10"="county")))
STATEFP10 COUNTYFP10 COUNTYNS10 GEOID10 NAME10
1 51 011 01497238 51011 Appomattox
2 51 017 01673638 51017 Bath
3 51 045 01673664 51045 Craig
4 51 103 01480139 51103 Lancaster
5 51 041 01480111 51041 Chesterfield
6 51 093 01702378 51093 Isle of Wight
NAMELSAD10 LSAD10 CLASSFP10 MTFCC10 CSAFP10 CBSAFP10
1 Appomattox County 06 H1 G4020 31340
2 Bath County 06 H1 G4020
3 Craig County 06 H1 G4020 40220
4 Lancaster County 06 H1 G4020
5 Chesterfield County 06 H1 G4020 40060
6 Isle of Wight County 06 H1 G4020 47260
METDIVFP10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10
1 A 863744566 3204517 +37.3707253 -078.8109404
2 A 1370512659 14049862 +38.0689876 -079.7328980
3 A 853489575 2798854 +37.4731287 -080.2317340
4 A 345115848 254201621 +37.7038306 -076.4131985
5 A 1096334108 35372995 +37.3784337 -077.5858474
6 A 817432028 122288802 +36.9014184 -076.7075687
tpop wpop bpop
1 14973 11597 3007
2 4731 4432 222
3 5190 5122 5
4 11391 7989 3184
5 316236 215954 69412
6 35270 25318 8712
# save the joined dataset
vacounty@data <- left_join(vacounty@data, vapop, by=c("COUNTYFP10"="county"))
Here is the fun part.
# Plot total population by county
# load package: tmap
library(tmap)
# qtm(): quick thematic map plot
qtm(vacounty, fill="tpop", title="Total Population")
References
- United States Census Bureau. (2022). Decennial census (2020, 2010, 2000). Census.gov. https://www.census.gov/data/developers/data-sets/decennial-census.2010.html
- Notes of a Dabbler. (2013, December 25). Exploring census and demographic data with R. https://www.r-bloggers.com/exploring-census-and-demographic-data-with-r/
Yun Tai
CLIR Postdoctoral Fellow
University of Virginia Library
September 29, 2016
Updated May 2023 to reflect changes to the US Census Bureau's APIs
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.