Corpus of Contemporary American English (COCA)

The Corpus of Contemporary American English (COCA) is the only large and “representative” corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created. These corpora were formerly known as the “BYU Corpora”, and they offer unparalleled insight into variation in English.

The corpus contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages.

Note: The license stipulates that the data is primarily intended for use in research, not teaching. Faculty and graduate students may request access to the data for their research. Undergraduate students may not have access. If you need corpus data for undergraduate classes, please use the standard web interface for the corpora.

Read more information about formats, including databases.

Please note restrictions of the data and the FAQ page.

Restrictions:

  • In no case can substantial amounts of the full-text data (typically, a total of 50,000 words or more) be distributed outside the organization listed on the license agreement. For example, you cannot create a large word list or set of n-grams, and then distribute this to others, and you could not copy 70,000 words from different texts and then place this on a website where users from outside your organization would have access to the data.
  • The data cannot be placed on a network (including the Internet), unless access to the data is limited (via restricted login, password, etc) just to those from the organization listed on the license agreement. For example, it cannot be placed on another corpus site, which indexes the data and then makes it available to end users, because that other corpus site would then have access to the data.
  • In addition to the full-text data itself, #2 also applies to derived frequency, collocates, n-grams, concordance and similar data that is based on the corpus.
  • If portions of the derived data is made available to others, it cannot include substantial portions of the the raw frequency of words (e.g. the word occurs 3,403 times in the corpus) or the rank order (e.g. it is the 304th most common words). (Note: it is acceptable to use the frequency data to place words and phrases in “frequency bands”, e.g. words 1-1000, 1001-3000, 3001-10,000, etc. However, there should not be more than about 20 frequency bands in your application.)
  • Any publications or products that are based on the data should contain a reference to the source of the data: https://www.corpusdata.org.
  • Note that a small, unique change will be made to each set of data, and this will serve as a “fingerprint” to identify you as the unique source of the datasets that you download. Automated Google searches are run daily to find copies of the data on the Web. If we find the data online and it is the data that was sent to you (and we will be able to determine that is the case), then you will be required to contact the administrators for that website, to have the data removed.

Suggested citation:

COCA

Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA). Available online at https://www.english-corpora.org/coca/.

Frequency data

Davies, Mark. (2008-) Word frequency data from The Corpus of Contemporary American English (COCA). Data available online at https://www.wordfrequency.info.

N-grams data

Davies, Mark. (2008-) N-grams data from The Corpus of Contemporary American English (COCA). Data available online at https://www.ngrams.info.

Collocates data

Davies, Mark. (2008-) Collocates data from The Corpus of Contemporary American English (COCA). Data available online at https://www.collocates.info.

 

Request access by contacting data@virginia.edu.

Last updated: 24 March 2023