<div dir="ltr">Dear all,<div><br></div><div style>A comparison between this 3 billion words text corpus and LDC gigawords corpus can be found at <a href="http://swoogle.umbc.edu/SimService/top_similarity.html">http://swoogle.umbc.edu/SimService/top_similarity.html</a> .</div>

<div style><br></div><div style>Happy news for those looking for a large and free text corpus. I had been looking for such a corpus but failed to find a one meeting my requirement so I created my own.</div><div style><br>

</div><div style><br></div><div style>Best regards,</div><div style><br></div><div style>Lushan</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, May 2, 2013 at 1:02 PM, Craig Pfeifer <span dir="ltr"><<a href="mailto:craig.pfeifer@gmail.com" target="_blank">craig.pfeifer@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Original announcement:<div><a href="http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/" target="_blank">http://ebiquity.umbc.edu/blogger/2013/05/01/umbc-webbase-corpus-of-3b-english-words/</a></div>


<div><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px"><br></span></div><div><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px">Please direct all inquiries to Dr. Tim Finin:</span></div>


<div><a href="http://www.csee.umbc.edu/~finin/" target="_blank">http://www.csee.umbc.edu/~finin/</a><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px"><br></span></div>

<div>

<span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px"><br></span></div><div><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px">The </span><a href="http://ebiq.org/r/351" style="font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px;color:rgb(51,153,255);text-decoration:none" target="_blank">UMBC WebBase corpus</a><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px"> </span><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px">is a dataset of high quality English paragraphs containing over three billion words derived from the</span><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px"> </span><a href="http://bit.ly/WebBase)" style="font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px;color:rgb(51,153,255);text-decoration:none" target="_blank">Stanford WebBase project’s</a><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px"> </span><span style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px;line-height:16px">February 2007 Web crawl. Compressed, its size is about 13GB. We have found it useful for building statistical language models that characterize English text found on the Web.</span><br>


</div><div><p style="font-size:12px;color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;line-height:16px">The February 2007 Stanford WebBase crawl is one of their largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.</p>


<p style="font-size:12px;color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;line-height:16px">We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.</p>


<p style="font-size:12px;color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;line-height:16px">The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.</p>


<ul style="color:rgb(51,51,51);font-family:Arial,Helvetica,Verdana,sans-serif;font-size:12px">Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield and Johnathan Weese,<a href="http://ebiquity.umbc.edu/paper/html/id/621/" style="color:rgb(51,153,255);text-decoration:none" target="_blank">UMBC EBIQUITY-CORE: Semantic Textual Similarity Systems</a>, Proc. 2nd Joint Conference on Lexical and Computational Semantics, Association for Computational Linguistics, June 2013. (<a href="http://ebiquity.umbc.edu/paper/bibtex/id/621/UMBC-EBIQUITY-CORE-Semantic-Textual-Similarity-Systems" style="color:rgb(51,153,255);text-decoration:none" target="_blank">bibtex</a>)</ul>


</div><div><br clear="all"><div>______________<br><a href="mailto:craig.pfeifer@gmail.com" target="_blank">craig.pfeifer@gmail.com</a></div>

</div></div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br></div>