[Corpora-List] ask for very large, well-balanced corpus

Adam Kilgarriff adam at lexmasterclass.com
Tue Jul 17 05:42:01 UTC 2012


UKWaC is bigger and available to download without bureaucracy - see
http://wacky.sslmit.unibo.it/doku.php

Other are available with a bit more bureaucracy (to cover legal concerns),

>   a well-balanced corpus

this is the million-dollar question.  No-one really knows what it means.
 BNC and COCA are balanced according to their designers' opinion of what it
means (mixed with the pragmatics of what was accessible).  UKWaC and other
web-crawled corpora are balanced according to the balance of the language
as found on the web.  Which is best, all depends on what you want.  (Not
that you are ever likely to find out which would have been better, since
the science of the question is in its infancy)

BNC and COCA have the advantage that the different text types they contain
are given to you in the metadata.  For web crawls, working them out is a
big and current research question (there's great work being done in Leeds
by Serge Sharoff and Richard Sutcliffe , I've seen the talk but it's not
published yet)

Adam

On 16 July 2012 23:42, Lushan Han <lushan1 at umbc.edu> wrote:

> COCA looks like a good one. But could I have a copy of the corpus and run
> my own programs on it? The web interface cannot meet my requirement.
>
> Thanks,
>
> Lushan Han
>
>
>
> On Mon, Jul 16, 2012 at 5:48 PM, Mark Davies <Mark_Davies at byu.edu> wrote:
>
>> >> Does anyone know where or how I can get a well-balanced corpus of
>> modern English, such as BNC, but with a much larger size?  I hope it can
>> have at least 1 billion words
>>
>> It's only 450 million words, but you might try:
>> http://corpus.byu.edu/coca (COCA)
>>
>> It is divided evenly into spoken, fiction, popular magazines, newspapers,
>> and academic, each with 90-95 million words.
>>
>> It is also much more recent than the BNC. COCA has 20 million words each
>> year, 1990-2012 (compared to the 1993 end date of the BNC).
>>
>> Finally, it has the same genre balance each year, which makes it nice for
>> looking at recent changes in English; see:
>>
>> Davies, Mark. (2011) "The Corpus of Contemporary American English as the
>> First Reliable Monitor Corpus of English". Literary and Linguistic
>> Computing 25: 447-65.
>>
>> Best,
>>
>> Mark Davies
>>
>> ============================================
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> http://davies-linguistics.byu.edu/
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>>
>>
>>
>>
>> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of
>> Lushan Han [lushan1 at umbc.edu]
>> Sent: Monday, July 16, 2012 1:10 PM
>> To: corpora at uib.no
>> Subject: [Corpora-List] ask for very large, well-balanced corpus
>>
>>
>> Dear all,
>>
>> Does anyone know where or how I can get a well-balanced corpus of modern
>> English, such as BNC, but with a much larger size? I hope it can have at
>> least 1 billion words. I tried to assemble a corpus from Wikipedia articles
>> but it turned out that such a corpus is not balanced. Wikipedia contains
>> many repetitions of the same type of articles, for example, films or birds.
>>
>> A Web corpus should be okay for my purpose as long as it was harvested
>> from balanced domains.
>>
>>
>> Thanks,
>>
>> Lushan Han
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for
English<http://www.webdante.com>
                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120717/bbe3d891/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list