Let's say we're interested in text mining the opinions of the Supreme Court of the United States. At the time of this writing, the opinions are published as PDF files at the following web page in the section titled "Opinions of the Court": https://www.supremecourt.gov/opinions/opinions.aspx. For the purposes of this introductory tutorial, we'll look at just three opinions from the 2014 term: (1) Glossip v. Gross, (2) State Legislature v. Arizona Independent Redistricting Comm’n, and (3) Michigan v. EPA. To follow along with this tutorial, you can download the opinions from this link: https://static.lib.virginia.edu/statlab/materials/data/opinions.zip. The opinions are in a zip file and will need to be extracted.
To begin we load the pdftools package. The pdftools package provides functions for extracting text from PDF files.
# install.packages("pdftools")
library(pdftools)
Next create a vector of PDF file names using the list.files()
function. The pattern
argument says to only grab those files ending with "pdf":
files <- list.files(pattern = "pdf$")
NOTE: the code above only works if you have your working directory set to the folder where you downloaded and extracted the PDF files. A quick way to do this in RStudio is to go to Session...Set Working Directory.
The files
vector contains all the PDF file names. We'll use this vector to automate the process of reading in the text of the PDF files.
The files
vector contains the three PDF file names.
files
[1] "13-1314_3ea4.pdf" "14-46_bqmc.pdf" "14-7955_aplc.pdf"
We'll use this vector to automate the process of reading in the text of the PDF files.
The pdftools function for extracting text is pdf_text()
. Using the lapply()
function, we can apply the pdf_text()
function to each element in the files
vector and create an object called opinions
.
opinions <- lapply(files, pdf_text)
This creates a list object with three elements, one for each document. The length()
function verifies it contains three elements:
length(opinions)
[1] 3
Each element is a vector that contains the text of the PDF file. The length of each vector corresponds to the number of pages in the PDF file. For example, the first vector has length 81 because the first PDF file has 81 pages. We can apply the length()
function to each element to see this:
lapply(opinions, length)
[[1]]
[1] 81
[[2]]
[1] 47
[[3]]
[1] 127
And we're pretty much done! The PDF files are now in R, ready to be cleaned up and analyzed. If you want to see what has been read in, you could enter the following in the console, but it's going to produce unpleasant blocks of text littered with escape characters such as \r
and \n
.
opinions
When text has been read into R, we typically proceed to some sort of analysis. Here's a quick demo of what we could do with the tm package (tm = text mining).
First we load the tm package and then create a corpus, which is basically a database for text. Notice that instead of working with the opinions
object we created earlier, we start over.
# install.packages("tm")
library(tm)
corp <- Corpus(URISource(files),
readerControl = list(reader = readPDF))
The Corpus()
function creates a corpus. The first argument to Corpus()
is what we want to use to create the corpus. In this case, it's the vector of PDF files. To do this, we use the URISource()
function to indicate that the files
vector is a URI source. URI stands for Uniform Resource Identifier. In other words, we're telling the Corpus()
function that the vector of file names identifies our resources. The second argument, readerControl
, tells Corpus()
which reader to use to read in the text from the PDF files. That would be readPDF()
, a tm function. The readerControl
argument requires a list of control parameters, one of which is reader
, so we enter list(reader = readPDF)
. Finally we save the result to an object called corp
.
It turns out that the readPDF()
function in the tm package actually creates a function that reads in PDF files. The documentation tells us it uses the pdftools::pdf_text()
function as the default, which is the same function we used above (?readPDF
).
Now that we have a corpus, we can create a term-document matrix, or TDM for short. A TDM stores counts of terms for each document. The tm package provides a function to create a TDM called TermDocumentMatrix()
.
opinions.tdm <- TermDocumentMatrix(corp,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
The first argument is our corpus. The second argument is a list of control parameters. In our example we tell the function to clean up the corpus before creating the TDM. We tell it to remove punctuation, remove stopwords (e.g., the, of, in, etc.), convert text to lowercase, stem the words, remove numbers, and only count words that appear in at least 3 documents. We save the result to an object called opinions.tdm
.
To inspect the TDM and see what it looks like, we can use the inspect()
function. Below we look at the first 10 terms:
inspect(opinions.tdm[1:10,])
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 30/0
Sparsity : 0%
Maximal term length: 6
Weighting : term frequency (tf)
Sample :
Docs
Terms 13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
—————— 26 6 21
—decid 1 1 1
“all 1 1 2
“each 5 1 1
“in 14 3 5
“is 3 1 4
“it 6 3 8
“not 1 4 6
“on 1 1 3
“that 2 1 2
We see words preceded with double quotes and dashes even though we specified removePunctuation = TRUE
. We even see a series of dashes being treated as a word. What happened? It appears the pdf_text()
function preserved the unicode curly-quotes and em-dashes used in the PDF files.
One way to take care of this is to manually use the removePunctuation()
function with tm_map()
, both functions in the tm package. The removePunctuation()
function has an argument called ucp
that, when set to TRUE
, will look for unicode punctuation. Here's how we can use use it to remove punctuation from the corpus:
corp <- tm_map(corp, removePunctuation, ucp = TRUE)
Now we can re-create the TDM, this time without the removePunctuation = TRUE
argument.
opinions.tdm <- TermDocumentMatrix(corp,
control =
list(stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
And this appears to have taken care of the punctuation problem.
inspect(opinions.tdm[1:10,])
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 30/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Docs
Terms 13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
abandon 1 1 8
abdic 1 1 1
absent 5 2 2
accept 6 4 12
accompani 1 2 2
accomplish 4 1 1
accord 12 10 13
account 1 26 8
accur 1 3 1
achiev 1 15 3
We see, for example, that the term "abandon" appears in the third PDF file 8 times. Also notice that words have been stemmed. The word "achiev" is the stemmed version of "achieve," "achieved," "achieves," and so on.
The tm package includes a few functions for summary statistics. We can use the findFreqTerms()
function to quickly find frequently occurring terms. To find words that occur at least 100 times:
findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)
[1] "also" "amend" "ant" "case" "cite" "claus" "congress"
[8] "constitut" "cost" "court" "decis" "dissent" "district" "effect"
[15] "elect" "execut" "feder" "find" "justic" "law" "major"
[22] "make" "may" "one" "opinion" "petition" "power" "reason"
[29] "requir" "see" "state" "time" "tion" "unit" "use"
To see the counts of those words, we could save the result and use it to subset the TDM. Notice we have to use as.matrix()
to see the print out of the subsetted TDM.
ft <- findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)
as.matrix(opinions.tdm[ft,])
Docs
Terms 13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
also 24 13 74
amend 57 9 84
ant 38 36 46
case 67 12 109
cite 52 27 78
claus 123 4 1
congress 70 43 3
constitut 190 4 81
cost 1 220 8
court 197 57 343
decis 27 41 33
dissent 77 44 124
district 90 4 81
effect 10 26 130
elect 178 1 4
execut 14 5 290
feder 77 8 28
find 9 60 54
justic 44 7 74
law 102 15 30
major 83 42 9
make 45 41 32
may 77 17 48
one 53 24 67
opinion 87 33 112
petition 3 11 127
power 98 115 8
reason 13 50 42
requir 22 40 52
see 101 66 182
state 529 25 260
time 29 19 63
tion 56 17 47
unit 63 22 38
use 37 13 140
year 22 19 92
To see the total counts for those words, we could save the matrix and apply the sum()
function across the rows:
ft.tdm <- as.matrix(opinions.tdm[ft,])
sort(apply(ft.tdm, 1, sum), decreasing = TRUE)
state court see execut constitut dissent opinion
814 597 349 309 275 245 232
cost power use case elect district effect
229 221 190 188 183 175 166
cite amend law one may petition major
157 150 147 144 142 141 134
year claus justic find unit ant tion
133 128 125 123 123 120 120
make congress requir feder also time reason
118 116 114 113 111 111 105
decis
101
Many more analyses are possible. But again the main point of this tutorial was how to read in text from PDF files for text mining. Hopefully this provides a template to get you started.
Clay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2016
Updated May 14, 2019
For questions or clarifications regarding this article, contact statlab@virginia.edu.
View the entire collection of UVA Library StatLab articles, or learn how to cite.