[Corpora-List] Boot Camp (Continued...)

Linas Vepstas linasvepstas at gmail.com
Mon Aug 18 17:49:16 UTC 2008


2008/8/18 W. Louw <louw at mango.zw>:

> Corpora used to be tiny (1million words traditionally) and held on heavy tapes
> that could be thrown onto desks with a shout of _habeas corpus!_ (heard less
> frequently since Gitmo). However, John Sinclair brought them to the point at
> which they are in excess of half a billion words of running text.

Except that the damned thing is not actually publically
available! It consists of copyrighted text.  Damn!  Those of
us who actually need to do do things outside of its limited,
proscribed "allowed" usage are shit-outa-luck, and are
reduced to analyzing Wikipedia, Project Gutenberg, and
arxiv.org. I'd like to have access to a larger body of text
that is more varied and diverse.

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list