[Corpora-List] 3B+ words of English from webpages

Milos Jakubicek jak at fi.muni.cz
Thu May 9 14:10:19 UTC 2013


Dear all,

another available corpus is Clueweb (see
http://lemurproject.org/clueweb09/, half a billion web pages of
English).  At Masaryk University we have taken this large web crawl,
then performed cleaning, paragraph-level deduplication, POS-tagging
and lemmatisation (TreeTagger), to give a 70 billion word English
corpus which we have then installed in our local installation of
Sketch Engine, available via http://corpora.fi.muni.cz/.

Anyone wanting access to the encoded corpus first has to sign a
licence agreement with Carnegie Mellon (no fee) and we shall then give
them access (for free too), for anyone wanting the whole dataset:
first get the original data, then you can apply the same pipeline (all
open-source except TreeTagger, which is free to academics) to prepare
the same corpus as described at LREC last year (see
http://www.lrec-conf.org/proceedings/lrec2012/pdf/1047_Paper.pdf)

Best regards,
Miloš Jakubíček

Natural Language Processing Centre
Faculty of Informatics, Masaryk University
Brno, Czech Republic

2013/5/3 Mark Fishel <fishel at ut.ee>:
> Dear all,
>
> nice resource, especially because of paragraph splitting.
>
> Another freely available huge (30B+ words) English corpus is based on
> Usenet: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html;
> it is, however, mostly raw text.
>
> Best regards,
> Mark
>
> On Thu, May 2, 2013 at 7:20 PM, Lushan Han <lushan1 at umbc.edu> wrote:
>> Dear all,
>>
>> A comparison between this 3 billion words text corpus and LDC gigawords
>> corpus can be found at
>> http://swoogle.umbc.edu/SimService/top_similarity.html .
>>
>> Happy news for those looking for a large and free text corpus. I had been
>> looking for such a corpus but failed to find a one meeting my requirement so
>> I created my own.
>>
>>
>> Best regards,
>>
>> Lushan
>>
>>
>> On Thu, May 2, 2013 at 1:02 PM, Craig Pfeifer <craig.pfeifer at gmail.com>
>> wrote:
>>>
>>> Original announcement:
>>>
>>> http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/
>>>
>>> Please direct all inquiries to Dr. Tim Finin:
>>> http://www.csee.umbc.edu/~finin/
>>>
>>> The UMBC WebBase corpus is a dataset of high quality English paragraphs
>>> containing over three billion words derived from the Stanford WebBase
>>> project’s February 2007 Web crawl. Compressed, its size is about 13GB. We
>>> have found it useful for building statistical language models that
>>> characterize English text found on the Web.
>>>
>>> The February 2007 Stanford WebBase crawl is one of their largest
>>> collections and contains 100 million web pages from more than 50,000
>>> websites. The Stanford WebBase project did an excellent job in extracting
>>> textual content from HTML tags but there are still many instances of text
>>> duplications, truncated texts, non-English texts and strange characters.
>>>
>>> We processed the collection to remove undesired sections and produce high
>>> quality English paragraphs. We detected paragraphs using heuristic rules and
>>> only retrained those whose length was at least two hundred characters. We
>>> eliminated non-English text by checking the first twenty words of a
>>> paragraph to see if they were valid English words. We used the percentage of
>>> punctuation characters in a paragraph as a simple check for typical text. We
>>> removed duplicate paragraphs using a hash table. The result is a corpus with
>>> approximately three billion words of good quality English.
>>>
>>> The dataset has been used in several projects. If you use the dataset,
>>> please refer to it by citing the following paper, which describes it and its
>>> use in a system that measures the semantic similarity of short text
>>> sequences.
>>>
>>> Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan
>>> Weese,UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems, Proc. 2nd
>>> Joint Conference on Lexical and Computational Semantics, Association for
>>> Computational Linguistics, June 2013. (bibtex)
>>>
>>>
>>> ______________
>>> craig.pfeifer at gmail.com
>>>
>>> _______________________________________________
>>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list