[Corpora-List] 155 *billion* (155, 000, 000, 000) word corpus of American English
Angus Grieve-Smith
grvsmth at panix.com
Thu May 12 16:37:46 UTC 2011
On 5/12/2011 11:15 AM, Mark Davies wrote:
>>> Is the corpus itself or part of it available for downloading? It would be more useful if we could process the raw text for our own purpose rather than accessing it from a web interface.
> As mentioned previously, the underlying n-grams data is freely available from Google at http://ngrams.googlelabs.com/datasets (see http://creativecommons.org/licenses/by/3.0/ re. licensing).
When I try to use it, I get "Session expired. Click here
<http://googlebooks.byu.edu/> to start new session."
In theory, though, all the books are available for free from
http://books.google.com/ . In the Google ngram interface at
http://ngrams.googlelabs.com/ there are links to date ranges. If you
click on those you will see a date range result for the search term on
the Google Books website. You can then click the "Plain text" link in
the upper right hand corner to see the OCRed text. Then you can
appreciate how rough some of the OCR has been.
--
-Angus B. Grieve-Smith
grvsmth at panix.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110512/8d69ffae/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list