[Corpora-List] 3B+ words of English from webpages

Lushan Han lushan1 at umbc.edu
Thu May 2 17:20:12 UTC 2013


Dear all,

A comparison between this 3 billion words text corpus and LDC gigawords
corpus can be found at
http://swoogle.umbc.edu/SimService/top_similarity.html .

Happy news for those looking for a large and free text corpus. I had been
looking for such a corpus but failed to find a one meeting my requirement
so I created my own.


Best regards,

Lushan


On Thu, May 2, 2013 at 1:02 PM, Craig Pfeifer <craig.pfeifer at gmail.com>wrote:

> Original announcement:
>
> http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/
>
> Please direct all inquiries to Dr. Tim Finin:
> http://www.csee.umbc.edu/~finin/
>
> The UMBC WebBase corpus <http://ebiq.org/r/351> is a dataset of high
> quality English paragraphs containing over three billion words derived from
> the Stanford WebBase project’s <http://bit.ly/WebBase)> February 2007 Web
> crawl. Compressed, its size is about 13GB. We have found it useful for
> building statistical language models that characterize English text found
> on the Web.
>
> The February 2007 Stanford WebBase crawl is one of their largest
> collections and contains 100 million web pages from more than 50,000
> websites. The Stanford WebBase project did an excellent job in extracting
> textual content from HTML tags but there are still many instances of text
> duplications, truncated texts, non-English texts and strange characters.
>
> We processed the collection to remove undesired sections and produce high
> quality English paragraphs. We detected paragraphs using heuristic rules
> and only retrained those whose length was at least two hundred characters.
> We eliminated non-English text by checking the first twenty words of a
> paragraph to see if they were valid English words. We used the percentage
> of punctuation characters in a paragraph as a simple check for typical
> text. We removed duplicate paragraphs using a hash table. The result is a
> corpus with approximately three billion words of good quality English.
>
> The dataset has been used in several projects. If you use the dataset,
> please refer to it by citing the following paper, which describes it and
> its use in a system that measures the semantic similarity of short text
> sequences.
>
>    Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan
>    Weese,UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems<http://ebiquity.umbc.edu/paper/html/id/621/>,
>    Proc. 2nd Joint Conference on Lexical and Computational Semantics,
>    Association for Computational Linguistics, June 2013. (bibtex<http://ebiquity.umbc.edu/paper/bibtex/id/621/UMBC-EBIQUITY-CORE-Semantic-Textual-Similarity-Systems>
>    )
>
>
> ______________
> craig.pfeifer at gmail.com
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130502/9d06ae0a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list