[Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus of American English

Angus Grieve-Smith grvsmth at panix.com
Thu May 12 16:37:46 UTC 2011


On 5/12/2011 11:15 AM, Mark Davies wrote:
>>> Is the corpus itself or part of it available for downloading? It would be more useful if we could process the raw text for our own purpose rather than accessing it from a web interface.
> As mentioned previously, the underlying n-grams data is freely available from Google at http://ngrams.googlelabs.com/datasets (see http://creativecommons.org/licenses/by/3.0/ re. licensing).

     When I try to use it, I get "Session expired. Click here 
<http://googlebooks.byu.edu/> to start new session."

     In theory, though, all the books are available for free from 
http://books.google.com/ .  In the Google ngram interface at 
http://ngrams.googlelabs.com/ there are links to date ranges.  If you 
click on those you will see a date range result for the search term on 
the Google Books website.  You can then click the "Plain text" link in 
the upper right hand corner to see the OCRed text.  Then you can 
appreciate how rough some of the OCR has been.

-- 
				-Angus B. Grieve-Smith
				grvsmth at panix.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110512/8d69ffae/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list