[Corpora-List] ask for very large, well-balanced corpus

Lushan Han lushan1 at umbc.edu
Tue Jul 17 19:07:16 UTC 2012


Thank you, guys. Your responses are very helpful!

Best,

Lushan

On Tue, Jul 17, 2012 at 1:42 AM, Adam Kilgarriff <adam at lexmasterclass.com>wrote:

> UKWaC is bigger and available to download without bureaucracy - see
> http://wacky.sslmit.unibo.it/doku.php
>
> Other are available with a bit more bureaucracy (to cover legal concerns),
>
> >   a well-balanced corpus
>
> this is the million-dollar question.  No-one really knows what it means.
>  BNC and COCA are balanced according to their designers' opinion of what it
> means (mixed with the pragmatics of what was accessible).  UKWaC and other
> web-crawled corpora are balanced according to the balance of the language
> as found on the web.  Which is best, all depends on what you want.  (Not
> that you are ever likely to find out which would have been better, since
> the science of the question is in its infancy)
>
> BNC and COCA have the advantage that the different text types they contain
> are given to you in the metadata.  For web crawls, working them out is a
> big and current research question (there's great work being done in Leeds
> by Serge Sharoff and Richard Sutcliffe , I've seen the talk but it's not
> published yet)
>
> Adam
>
>  On 16 July 2012 23:42, Lushan Han <lushan1 at umbc.edu> wrote:
>
>>  COCA looks like a good one. But could I have a copy of the corpus and
>> run my own programs on it? The web interface cannot meet my requirement.
>>
>> Thanks,
>>
>> Lushan Han
>>
>>
>>
>> On Mon, Jul 16, 2012 at 5:48 PM, Mark Davies <Mark_Davies at byu.edu> wrote:
>>
>>> >> Does anyone know where or how I can get a well-balanced corpus of
>>> modern English, such as BNC, but with a much larger size?  I hope it can
>>> have at least 1 billion words
>>>
>>> It's only 450 million words, but you might try:
>>> http://corpus.byu.edu/coca (COCA)
>>>
>>> It is divided evenly into spoken, fiction, popular magazines,
>>> newspapers, and academic, each with 90-95 million words.
>>>
>>> It is also much more recent than the BNC. COCA has 20 million words each
>>> year, 1990-2012 (compared to the 1993 end date of the BNC).
>>>
>>> Finally, it has the same genre balance each year, which makes it nice
>>> for looking at recent changes in English; see:
>>>
>>> Davies, Mark. (2011) "The Corpus of Contemporary American English as the
>>> First Reliable Monitor Corpus of English". Literary and Linguistic
>>> Computing 25: 447-65.
>>>
>>> Best,
>>>
>>> Mark Davies
>>>
>>> ============================================
>>> Mark Davies
>>> Professor of Linguistics / Brigham Young University
>>> http://davies-linguistics.byu.edu/
>>> ** Corpus design and use // Linguistic databases **
>>> ** Historical linguistics // Language variation **
>>> ** English, Spanish, and Portuguese **
>>> ============================================
>>>
>>>
>>>
>>>
>>> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of
>>> Lushan Han [lushan1 at umbc.edu]
>>> Sent: Monday, July 16, 2012 1:10 PM
>>> To: corpora at uib.no
>>> Subject: [Corpora-List] ask for very large, well-balanced corpus
>>>
>>>
>>> Dear all,
>>>
>>> Does anyone know where or how I can get a well-balanced corpus of modern
>>> English, such as BNC, but with a much larger size? I hope it can have at
>>> least 1 billion words. I tried to assemble a corpus from Wikipedia articles
>>> but it turned out that such a corpus is not balanced. Wikipedia contains
>>> many repetitions of the same type of articles, for example, films or birds.
>>>
>>> A Web corpus should be okay for my purpose as long as it was harvested
>>> from balanced domains.
>>>
>>>
>>> Thanks,
>>>
>>> Lushan Han
>>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
>
> --
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director                                    Lexical Computing Ltd<http://www.sketchengine.co.uk/>
>
> Visiting Research Fellow                 University of Leeds<http://leeds.ac.uk/>
>
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk/>
>
>                         *DANTE: a lexical database for English<http://www.webdante.com/>
>                   *
> ========================================
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120717/e45a2491/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list