[Corpora-List] ask for very large, well-balanced corpus

Lushan Han lushan1 at umbc.edu
Mon Jul 16 22:42:02 UTC 2012


COCA looks like a good one. But could I have a copy of the corpus and run
my own programs on it? The web interface cannot meet my requirement.

Thanks,

Lushan Han



On Mon, Jul 16, 2012 at 5:48 PM, Mark Davies <Mark_Davies at byu.edu> wrote:

> >> Does anyone know where or how I can get a well-balanced corpus of
> modern English, such as BNC, but with a much larger size?  I hope it can
> have at least 1 billion words
>
> It's only 450 million words, but you might try: http://corpus.byu.edu/coca(COCA)
>
> It is divided evenly into spoken, fiction, popular magazines, newspapers,
> and academic, each with 90-95 million words.
>
> It is also much more recent than the BNC. COCA has 20 million words each
> year, 1990-2012 (compared to the 1993 end date of the BNC).
>
> Finally, it has the same genre balance each year, which makes it nice for
> looking at recent changes in English; see:
>
> Davies, Mark. (2011) "The Corpus of Contemporary American English as the
> First Reliable Monitor Corpus of English". Literary and Linguistic
> Computing 25: 447-65.
>
> Best,
>
> Mark Davies
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
>
>
>
>
> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Lushan
> Han [lushan1 at umbc.edu]
> Sent: Monday, July 16, 2012 1:10 PM
> To: corpora at uib.no
> Subject: [Corpora-List] ask for very large, well-balanced corpus
>
>
> Dear all,
>
> Does anyone know where or how I can get a well-balanced corpus of modern
> English, such as BNC, but with a much larger size? I hope it can have at
> least 1 billion words. I tried to assemble a corpus from Wikipedia articles
> but it turned out that such a corpus is not balanced. Wikipedia contains
> many repetitions of the same type of articles, for example, films or birds.
>
> A Web corpus should be okay for my purpose as long as it was harvested
> from balanced domains.
>
>
> Thanks,
>
> Lushan Han
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120716/ff6a28f2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list