Reading PDF Files into R for Text Mining

Let's say we're interested in text mining the opinions of the Supreme Court of the United States. At the time of this writing, the opinions are published as PDF files at the following web page in the section titled "Opinions of the Court": https://www.supremecourt.gov/opinions/opinions.aspx. For the purposes of this introductory tutorial, we'll look at just three opinions from the 2014 term: (1) Glossip v. Gross, (2) State Legislature v. Arizona Independent Redistricting Comm’n, and (3) Michigan v. EPA. To follow along with this tutorial, you can download the opinions from this link: https://static.lib.virginia.edu/statlab/materials/data/opinions.zip. The opinions are in a zip file and will need to be extracted.

To begin we load the pdftools package. The pdftools package provides functions for extracting text from PDF files.


# install.packages("pdftools")
library(pdftools)

Next create a vector of PDF file names using the list.files() function. The pattern argument says to only grab those files ending with "pdf":


files <- list.files(pattern = "pdf$")

NOTE: the code above only works if you have your working directory set to the folder where you downloaded and extracted the PDF files. A quick way to do this in RStudio is to go to Session...Set Working Directory.

The files vector contains all the PDF file names. We'll use this vector to automate the process of reading in the text of the PDF files.

The files vector contains the three PDF file names.


files


[1] "13-1314_3ea4.pdf" "14-46_bqmc.pdf"   "14-7955_aplc.pdf"

We'll use this vector to automate the process of reading in the text of the PDF files.

The pdftools function for extracting text is pdf_text(). Using the lapply() function, we can apply the pdf_text() function to each element in the files vector and create an object called opinions.


opinions <- lapply(files, pdf_text)

This creates a list object with three elements, one for each document. The length() function verifies it contains three elements:


length(opinions)


[1] 3

Each element is a vector that contains the text of the PDF file. The length of each vector corresponds to the number of pages in the PDF file. For example, the first vector has length 81 because the first PDF file has 81 pages. We can apply the length() function to each element to see this:


lapply(opinions, length) 


[[1]]
[1] 81

[[2]]
[1] 47

[[3]]
[1] 127

And we're pretty much done! The PDF files are now in R, ready to be cleaned up and analyzed. If you want to see what has been read in, you could enter the following in the console, but it's going to produce unpleasant blocks of text littered with escape characters such as \r and \n.


opinions

When text has been read into R, we typically proceed to some sort of analysis. Here's a quick demo of what we could do with the tm package (tm = text mining).

First we load the tm package and then create a corpus, which is basically a database for text. Notice that instead of working with the opinions object we created earlier, we start over.


# install.packages("tm")
library(tm)
corp <- Corpus(URISource(files),
               readerControl = list(reader = readPDF))

The Corpus() function creates a corpus. The first argument to Corpus() is what we want to use to create the corpus. In this case, it's the vector of PDF files. To do this, we use the URISource() function to indicate that the files vector is a URI source. URI stands for Uniform Resource Identifier. In other words, we're telling the Corpus() function that the vector of file names identifies our resources. The second argument, readerControl, tells Corpus() which reader to use to read in the text from the PDF files. That would be readPDF(), a tm function. The readerControl argument requires a list of control parameters, one of which is reader, so we enter list(reader = readPDF). Finally we save the result to an object called corp.

It turns out that the readPDF() function in the tm package actually creates a function that reads in PDF files. The documentation tells us it uses the pdftools::pdf_text() function as the default, which is the same function we used above (?readPDF).

Now that we have a corpus, we can create a term-document matrix, or TDM for short. A TDM stores counts of terms for each document. The tm package provides a function to create a TDM called TermDocumentMatrix().


opinions.tdm <- TermDocumentMatrix(corp, 
                                   control = 
                                     list(removePunctuation = TRUE,
                                          stopwords = TRUE,
                                          tolower = TRUE,
                                          stemming = TRUE,
                                          removeNumbers = TRUE,
                                          bounds = list(global = c(3, Inf)))) 

The first argument is our corpus. The second argument is a list of control parameters. In our example we tell the function to clean up the corpus before creating the TDM. We tell it to remove punctuation, remove stopwords (e.g., the, of, in, etc.), convert text to lowercase, stem the words, remove numbers, and only count words that appear in at least 3 documents. We save the result to an object called opinions.tdm.

To inspect the TDM and see what it looks like, we can use the inspect() function. Below we look at the first 10 terms:


inspect(opinions.tdm[1:10,]) 


<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 30/0
Sparsity           : 0%
Maximal term length: 6
Weighting          : term frequency (tf)
Sample             :
        Docs
Terms    13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
  ——————               26              6               21
  —decid                1              1                1
  “all                  1              1                2
  “each                 5              1                1
  “in                  14              3                5
  “is                   3              1                4
  “it                   6              3                8
  “not                  1              4                6
  “on                   1              1                3
  “that                 2              1                2

We see words preceded with double quotes and dashes even though we specified removePunctuation = TRUE. We even see a series of dashes being treated as a word. What happened? It appears the pdf_text() function preserved the unicode curly-quotes and em-dashes used in the PDF files.

One way to take care of this is to manually use the removePunctuation() function with tm_map(), both functions in the tm package. The removePunctuation() function has an argument called ucp that, when set to TRUE, will look for unicode punctuation. Here's how we can use use it to remove punctuation from the corpus:


corp <- tm_map(corp, removePunctuation, ucp = TRUE)

Now we can re-create the TDM, this time without the removePunctuation = TRUE argument.


opinions.tdm <- TermDocumentMatrix(corp, 
                                   control = 
                                     list(stopwords = TRUE,
                                          tolower = TRUE,
                                          stemming = TRUE,
                                          removeNumbers = TRUE,
                                          bounds = list(global = c(3, Inf)))) 

And this appears to have taken care of the punctuation problem.


inspect(opinions.tdm[1:10,]) 

		
<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 30/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
            Docs
Terms        13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
  abandon                   1              1                8
  abdic                     1              1                1
  absent                    5              2                2
  accept                    6              4               12
  accompani                 1              2                2
  accomplish                4              1                1
  accord                   12             10               13
  account                   1             26                8
  accur                     1              3                1
  achiev                    1             15                3
		
	

We see, for example, that the term "abandon" appears in the third PDF file 8 times. Also notice that words have been stemmed. The word "achiev" is the stemmed version of "achieve," "achieved," "achieves," and so on.

The tm package includes a few functions for summary statistics. We can use the findFreqTerms() function to quickly find frequently occurring terms. To find words that occur at least 100 times:


findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)


 [1] "also"      "amend"     "ant"       "case"      "cite"      "claus"     "congress" 
 [8] "constitut" "cost"      "court"     "decis"     "dissent"   "district"  "effect"   
[15] "elect"     "execut"    "feder"     "find"      "justic"    "law"       "major"    
[22] "make"      "may"       "one"       "opinion"   "petition"  "power"     "reason"   
[29] "requir"    "see"       "state"     "time"      "tion"      "unit"      "use"    

To see the counts of those words, we could save the result and use it to subset the TDM. Notice we have to use as.matrix() to see the print out of the subsetted TDM.


ft <- findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)
as.matrix(opinions.tdm[ft,]) 


           Docs
Terms       13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
  also                    24             13               74
  amend                   57              9               84
  ant                     38             36               46
  case                    67             12              109
  cite                    52             27               78
  claus                  123              4                1
  congress                70             43                3
  constitut              190              4               81
  cost                     1            220                8
  court                  197             57              343
  decis                   27             41               33
  dissent                 77             44              124
  district                90              4               81
  effect                  10             26              130
  elect                  178              1                4
  execut                  14              5              290
  feder                   77              8               28
  find                     9             60               54
  justic                  44              7               74
  law                    102             15               30
  major                   83             42                9
  make                    45             41               32
  may                     77             17               48
  one                     53             24               67
  opinion                 87             33              112
  petition                 3             11              127
  power                   98            115                8
  reason                  13             50               42
  requir                  22             40               52
  see                    101             66              182
  state                  529             25              260
  time                    29             19               63
  tion                    56             17               47
  unit                    63             22               38
  use                     37             13              140
  year                    22             19               92

To see the total counts for those words, we could save the matrix and apply the sum() function across the rows:


ft.tdm <- as.matrix(opinions.tdm[ft,])
sort(apply(ft.tdm, 1, sum), decreasing = TRUE)

		
    state     court       see    execut constitut   dissent   opinion 
      814       597       349       309       275       245       232 
     cost     power       use      case     elect  district    effect 
      229       221       190       188       183       175       166 
     cite     amend       law       one       may  petition     major 
      157       150       147       144       142       141       134 
     year     claus    justic      find      unit       ant      tion 
      133       128       125       123       123       120       120 
     make  congress    requir     feder      also      time    reason 
      118       116       114       113       111       111       105 
     decis 
      101

Many more analyses are possible. But again the main point of this tutorial was how to read in text from PDF files for text mining. Hopefully this provides a template to get you started.


Clay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2016
Updated May 14, 2019


For questions or clarifications regarding this article, contact statlab@virginia.edu.

View the entire collection of UVA Library StatLab articles, or learn how to cite.