[Corpora-List] Boot Camp (Continued...)

Linas Vepstas linasvepstas at gmail.com
Mon Aug 18 19:01:34 UTC 2008


2008/8/18 Mark Davies <Mark_Davies at byu.edu>:
>> > Corpora used to be tiny (1million words traditionally) ... However, John Sinclair brought them to the point at
>> > which they are in excess of half a billion words of running text.
>>
>> Except that the damned thing is not actually publically
>> available! It consists of copyrighted text.  Damn!  Those of
>> us who actually need to do do things outside of its limited,
>> proscribed "allowed" usage are shit-outa-luck, and are
>> reduced to analyzing Wikipedia, Project Gutenberg, and
>> arxiv.org. I'd like to have access to a larger body of text
>> that is more varied and diverse.
>
> How about the "Corpus of Contemporary American English" (www.americancorpus.org)?
>
>  with many different types of searches

Err.. I need access to the original source, so that I can
run it through my tools.  I suppose that exploring the
landscape by looking through a peephole in a wall
is appropriate for some approaches, but clearly fails
if what you need is not visible in the peephole.

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list