[Corpora-List] ask for very large, well-balanced corpus

Mark Davies Mark_Davies at byu.edu
Mon Jul 16 21:48:41 UTC 2012


>> Does anyone know where or how I can get a well-balanced corpus of modern English, such as BNC, but with a much larger size?  I hope it can have at least 1 billion words

It's only 450 million words, but you might try: http://corpus.byu.edu/coca (COCA)

It is divided evenly into spoken, fiction, popular magazines, newspapers, and academic, each with 90-95 million words.

It is also much more recent than the BNC. COCA has 20 million words each year, 1990-2012 (compared to the 1993 end date of the BNC).

Finally, it has the same genre balance each year, which makes it nice for looking at recent changes in English; see:

Davies, Mark. (2011) "The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English". Literary and Linguistic Computing 25: 447-65.

Best,

Mark Davies

============================================
Mark Davies
Professor of Linguistics / Brigham Young University
http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================




From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Lushan Han [lushan1 at umbc.edu]
Sent: Monday, July 16, 2012 1:10 PM
To: corpora at uib.no
Subject: [Corpora-List] ask for very large, well-balanced corpus


Dear all,

Does anyone know where or how I can get a well-balanced corpus of modern English, such as BNC, but with a much larger size? I hope it can have at least 1 billion words. I tried to assemble a corpus from Wikipedia articles but it turned out that such a corpus is not balanced. Wikipedia contains many repetitions of the same type of articles, for example, films or birds.

A Web corpus should be okay for my purpose as long as it was harvested from balanced domains.


Thanks,

Lushan Han 
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list