[Corpora-List] 3B+ words of English from webpages

Mark Fishel fishel at ut.ee
Fri May 3 08:00:49 UTC 2013


Dear all,

nice resource, especially because of paragraph splitting.

Another freely available huge (30B+ words) English corpus is based on
Usenet: http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html;
it is, however, mostly raw text.

Best regards,
Mark

On Thu, May 2, 2013 at 7:20 PM, Lushan Han <lushan1 at umbc.edu> wrote:
> Dear all,
>
> A comparison between this 3 billion words text corpus and LDC gigawords
> corpus can be found at
> http://swoogle.umbc.edu/SimService/top_similarity.html .
>
> Happy news for those looking for a large and free text corpus. I had been
> looking for such a corpus but failed to find a one meeting my requirement so
> I created my own.
>
>
> Best regards,
>
> Lushan
>
>
> On Thu, May 2, 2013 at 1:02 PM, Craig Pfeifer <craig.pfeifer at gmail.com>
> wrote:
>>
>> Original announcement:
>>
>> http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/
>>
>> Please direct all inquiries to Dr. Tim Finin:
>> http://www.csee.umbc.edu/~finin/
>>
>> The UMBC WebBase corpus is a dataset of high quality English paragraphs
>> containing over three billion words derived from the Stanford WebBase
>> project’s February 2007 Web crawl. Compressed, its size is about 13GB. We
>> have found it useful for building statistical language models that
>> characterize English text found on the Web.
>>
>> The February 2007 Stanford WebBase crawl is one of their largest
>> collections and contains 100 million web pages from more than 50,000
>> websites. The Stanford WebBase project did an excellent job in extracting
>> textual content from HTML tags but there are still many instances of text
>> duplications, truncated texts, non-English texts and strange characters.
>>
>> We processed the collection to remove undesired sections and produce high
>> quality English paragraphs. We detected paragraphs using heuristic rules and
>> only retrained those whose length was at least two hundred characters. We
>> eliminated non-English text by checking the first twenty words of a
>> paragraph to see if they were valid English words. We used the percentage of
>> punctuation characters in a paragraph as a simple check for typical text. We
>> removed duplicate paragraphs using a hash table. The result is a corpus with
>> approximately three billion words of good quality English.
>>
>> The dataset has been used in several projects. If you use the dataset,
>> please refer to it by citing the following paper, which describes it and its
>> use in a system that measures the semantic similarity of short text
>> sequences.
>>
>> Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan
>> Weese,UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems, Proc. 2nd
>> Joint Conference on Lexical and Computational Semantics, Association for
>> Computational Linguistics, June 2013. (bibtex)
>>
>>
>> ______________
>> craig.pfeifer at gmail.com
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list