[Corpora-List] Now available: Downloadable COCA and GloWbE full-text corpus data

Mark Davies Mark_Davies at byu.edu
Mon Mar 17 13:44:58 UTC 2014


At http://corpus.byu.edu/full-text/ you can now download full-text data for the following two corpora:

  *   Corpus of Contemporary American English<http://corpus.byu.edu/coca/> (COCA). 440 million words of downloadable text (190,000 separate texts). Balanced for genre — about 88 million words each of spoken, fiction, magazine, newspaper, and academic. With the included [sources] table, you can also search by sub-genre, e.g. News-Financial or Academic-Medicine.
  *   The corpus of Global Web-Based English<http://corpus2.byu.edu/glowbe/> (GloWbE). 1.8 billion words of downloadable text (1,800,000 separate texts). Divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

Of course with the full-text data from either corpus, you will have the actual corpora on your computer. As a result, you can do many things that would be difficult or impossible with the standard web interface<http://corpus.byu.edu/>, such as sentiment analysis, topic modeling, named entity recognition, advanced regex searches, creating treebanks, and so on.

The data comes in three different formats<http://corpus.byu.edu/full-text/formats.asp> (see samples<http://corpus.byu.edu/full-text/samples.asp>): data for relational databases (info<http://corpus.byu.edu/full-text/database.asp>), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data<http://corpus.byu.edu/full-text/purchase.asp>, you purchase the rights to any and all of these formats.

Best,

Mark Davies

http://davies-linguistics.byu.edu/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140317/27229dc9/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list