[Corpora-List] 3B+ words of English from webpages

Milos Jakubicek jak at fi.muni.cz
Thu May 9 14:10:19 UTC 2013

Dear all,

another available corpus is Clueweb (see
http://lemurproject.org/clueweb09/, half a billion web pages of
English).  At Masaryk University we have taken this large web crawl,
then performed cleaning, paragraph-level deduplication, POS-tagging
and lemmatisation (TreeTagger), to give a 70 billion word English
corpus which we have then installed in our local installation of
Sketch Engine, available via http://corpora.fi.muni.cz/.

Anyone wanting access to the encoded corpus first has to sign a
licence agreement with Carnegie Mellon (no fee) and we shall then give
them access (for free too), for anyone wanting the whole dataset:
first get the original data, then you can apply the same pipeline (all
open-source except TreeTagger, which is free to academics) to prepare
the same corpus as described at LREC last year (see

Best regards,
Miloš Jakubíček

Natural Language Processing Centre
Faculty of Informatics, Masaryk University
Brno, Czech Republic

2013/5/3 Mark Fishel <fishel at ut.ee>:
> Dear all,
> nice resource, especially because of paragraph splitting.
> Another freely available huge (30B+ words) English corpus is based on
> Usenet: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html;
> it is, however, mostly raw text.
> Best regards,
> Mark
> On Thu, May 2, 2013 at 7:20 PM, Lushan Han <lushan1 at umbc.edu> wrote:
>> Dear all,
>> A comparison between this 3 billion words text corpus and LDC gigawords
>> corpus can be found at
>> http://swoogle.umbc.edu/SimService/top_similarity.html .
>> Happy news for those looking for a large and free text corpus. I had been
>> looking for such a corpus but failed to find a one meeting my requirement so
>> I created my own.
>> Best regards,
>> Lushan
>> On Thu, May 2, 2013 at 1:02 PM, Craig Pfeifer <craig.pfeifer at gmail.com>
>> wrote:
>>> Original announcement:
>>> http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/
>>> Please direct all inquiries to Dr. Tim Finin:
>>> http://www.csee.umbc.edu/~finin/
>>> The UMBC WebBase corpus is a dataset of high quality English paragraphs
>>> containing over three billion words derived from the Stanford WebBase
>>> project’s February 2007 Web crawl. Compressed, its size is about 13GB. We
>>> have found it useful for building statistical language models that
>>> characterize English text found on the Web.
>>> The February 2007 Stanford WebBase crawl is one of their largest
>>> collections and contains 100 million web pages from more than 50,000
>>> websites. The Stanford WebBase project did an excellent job in extracting
>>> textual content from HTML tags but there are still many instances of text
>>> duplications, truncated texts, non-English texts and strange characters.
>>> We processed the collection to remove undesired sections and produce high
>>> quality English paragraphs. We detected paragraphs using heuristic rules and
>>> only retrained those whose length was at least two hundred characters. We
>>> eliminated non-English text by checking the first twenty words of a
>>> paragraph to see if they were valid English words. We used the percentage of
>>> punctuation characters in a paragraph as a simple check for typical text. We
>>> removed duplicate paragraphs using a hash table. The result is a corpus with
>>> approximately three billion words of good quality English.
>>> The dataset has been used in several projects. If you use the dataset,
>>> please refer to it by citing the following paper, which describes it and its
>>> use in a system that measures the semantic similarity of short text
>>> sequences.
>>> Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan
>>> Weese,UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems, Proc. 2nd
>>> Joint Conference on Lexical and Computational Semantics, Association for
>>> Computational Linguistics, June 2013. (bibtex)
>>> ______________
>>> craig.pfeifer at gmail.com
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no

More information about the Corpora mailing list