[Corpora-List] Boot Camp (Continued...)

Mark Davies Mark_Davies at byu.edu
Mon Aug 18 18:42:58 UTC 2008


> > Corpora used to be tiny (1million words traditionally) ... However, John Sinclair brought them to the point at
> > which they are in excess of half a billion words of running text.
>
> Except that the damned thing is not actually publically
> available! It consists of copyrighted text.  Damn!  Those of
> us who actually need to do do things outside of its limited,
> proscribed "allowed" usage are shit-outa-luck, and are
> reduced to analyzing Wikipedia, Project Gutenberg, and
> arxiv.org. I'd like to have access to a larger body of text
> that is more varied and diverse.

How about the "Corpus of Contemporary American English" (www.americancorpus.org)?

It's at 370+ million words already, and it is recent. Currently it has texts through Dec 2007, and will be current up through June 30, 2008 in about one week (when it will be at 380+ million words). It is balanced (for each year, and therefore overall, as well) between spoken, fiction, popular magazines, newspaper, and academic (cf. to the very newspaper-centric BoE). It also has a very robust architecture, with many different types of searches (including full collocates searching, B.L., as well as one-click comparisons of collocates between different words or between different genres).

And it's free.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list