[Corpora-List] 3B+ words of English from webpages

Craig Pfeifer craig.pfeifer at gmail.com
Thu May 2 17:02:27 UTC 2013


Original announcement:
http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/

Please direct all inquiries to Dr. Tim Finin:
http://www.csee.umbc.edu/~finin/

The UMBC WebBase corpus <http://ebiq.org/r/351> is a dataset of high
quality English paragraphs containing over three billion words derived from
the Stanford WebBase project’s <http://bit.ly/WebBase)> February 2007 Web
crawl. Compressed, its size is about 13GB. We have found it useful for
building statistical language models that characterize English text found
on the Web.

The February 2007 Stanford WebBase crawl is one of their largest
collections and contains 100 million web pages from more than 50,000
websites. The Stanford WebBase project did an excellent job in extracting
textual content from HTML tags but there are still many instances of text
duplications, truncated texts, non-English texts and strange characters.

We processed the collection to remove undesired sections and produce high
quality English paragraphs. We detected paragraphs using heuristic rules
and only retrained those whose length was at least two hundred characters.
We eliminated non-English text by checking the first twenty words of a
paragraph to see if they were valid English words. We used the percentage
of punctuation characters in a paragraph as a simple check for typical
text. We removed duplicate paragraphs using a hash table. The result is a
corpus with approximately three billion words of good quality English.

The dataset has been used in several projects. If you use the dataset,
please refer to it by citing the following paper, which describes it and
its use in a system that measures the semantic similarity of short text
sequences.

   Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan
   Weese,UMBC EBIQUITY-CORE: Semantic Textual Similarity
Systems<http://ebiquity.umbc.edu/paper/html/id/621/>,
   Proc. 2nd Joint Conference on Lexical and Computational Semantics,
   Association for Computational Linguistics, June 2013.
(bibtex<http://ebiquity.umbc.edu/paper/bibtex/id/621/UMBC-EBIQUITY-CORE-Semantic-Textual-Similarity-Systems>
   )


______________
craig.pfeifer at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130502/34bd6991/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list