[Corpora-List] Boot Camp (Continued...)
Linas Vepstas
linasvepstas at gmail.com
Mon Aug 18 19:01:34 UTC 2008
2008/8/18 Mark Davies <Mark_Davies at byu.edu>:
>> > Corpora used to be tiny (1million words traditionally) ... However, John Sinclair brought them to the point at
>> > which they are in excess of half a billion words of running text.
>>
>> Except that the damned thing is not actually publically
>> available! It consists of copyrighted text. Damn! Those of
>> us who actually need to do do things outside of its limited,
>> proscribed "allowed" usage are shit-outa-luck, and are
>> reduced to analyzing Wikipedia, Project Gutenberg, and
>> arxiv.org. I'd like to have access to a larger body of text
>> that is more varied and diverse.
>
> How about the "Corpus of Contemporary American English" (www.americancorpus.org)?
>
> with many different types of searches
Err.. I need access to the original source, so that I can
run it through my tools. I suppose that exploring the
landscape by looking through a peephole in a wall
is appropriate for some approaches, but clearly fails
if what you need is not visible in the peephole.
--linas
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list