26.1306, FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

The LINGUIST List via LINGUIST linguist at listserv.linguistlist.org
Mon Mar 9 19:03:36 UTC 2015


LINGUIST List: Vol-26-1306. Mon Mar 09 2015. ISSN: 1069 - 4875.

Subject: 26.1306, FYI: COHA Full-Text Data: 385 Million Words, 116k Texts

Moderators: linguist at linguistlist.org (Damir Cavar, Malgorzata E. Cavar)
Reviews: reviews at linguistlist.org (Anthony Aristar, Helen Aristar-Dry, Sara Couture)
Homepage: http://linguistlist.org

*************    LINGUIST List 2015 Fund Drive    *************
Please support the LL editors and operation with a donation at:

              http://funddrive.linguistlist.org/

Editor for this issue: Uliana Kazagasheva <uliana at linguistlist.org>
================================================================


Date: Mon, 09 Mar 2015 15:01:06
From: Mark Davies [mark_davies at byu.edu]
Subject: COHA Full-Text Data: 385 Million Words, 116k Texts

 
This announcement is for those who are interested in historical corpora and
who may want a large dataset to work with on their own machine. This is a real
corpus, rather than just n-grams (as with the Google Books n-grams; see a
comparison at http://googlebooks.byu.edu/compare-googleBooks.asp).

We are pleased to announce that the Corpus of Historical American English
(COHA; http://corpus.byu.edu/coha/) is now available in downloadable full-text
format, for use on your own computer.
http://corpus.byu.edu/full-text/

COHA joins COCA and GloWbE, which have been available in downloadable
full-text format since March 2014.

The downloadable version of COHA contains 385 million words of text in more
than 115,000 separate texts, covering fiction, popular magazines, newspaper
articles, and non-fiction books from the 1810s to the 2000s (see
http://corpus.byu.edu/full-text/coha_full_text.asp).

At 385 million words in size, the downloadable COHA corpus is much larger than
any other structured historical corpus of English. With this large amount of
data, you can carry out many types of research that would not be possible with
much smaller 5-10 million word historical corpora of English (see
http://corpus.byu.edu/compare-smallCorpora.asp).

The corpus is available in several formats: sentence/paragraph, PoS-tagged and
lemmatized (one word per line), and for input into a relational database.
Samples of each format (3.6 million words each) are available at the full-text
website.

We hope that this new resource is of value to you in your research and
teaching.

Mark Davies
Brigham Young University
http://davies-linguistics.byu.edu/
http://corpus.byu.edu/
 



Linguistic Field(s): Computational Linguistics
                     Historical Linguistics
                     Lexicography
                     Text/Corpus Linguistics

Subject Language(s): English (eng)





 



----------------------------------------------------------
LINGUIST List: Vol-26-1306	
----------------------------------------------------------
Visit LL's Multitree project for over 1000 trees dynamically generated
from scholarly hypotheses about language relationships:
          http://multitree.org/








More information about the LINGUIST mailing list