<div class="gmail_quote">Hi liling,<br><br>There are a few surveys of Wikipedia for NLP, but many of the applications have not focussed on the full text as a corpus, and I know of no survey that particularly discusses that. However, some of my work is relevant:<br>


<br>My honours research sought to use Wikipedia data as training for Named Entity Recognition (see references at <a href="http://schwa.org/projects/resources/wiki/Wikiner" target="_blank">http://schwa.org/projects/resources/wiki/Wikiner</a>).<br>


<br>Comparing our results on Wikipedia-trained models to those trained on news corpora uncovered differences in the text and entity distributions, which we discuss in our <a href="http://www.aclweb.org/anthology-new/E/E09/E09-1070.pdf" target="_blank">EACL 2009 paper</a> (and with more data in the <a href="http://www.joelnothman.com/downloads/honsthesis.pdf" target="_blank">thesis</a>), and from a different perspective when evaluating NER models on Wikipedia text (<a href="http://www.joelnothman.com/downloads/PeoplesWeb02.pdf" target="_blank">PeoplesWeb 2009</a>).<div>


<br></div><div>Apart from obvious genre features and topic distribution issues (e.g. a long, heavy tail of localities, albums, minor personalities; lots of mathematics), the most notable issues for our work were:</div><div>


<ul><li>a near-absence of abbreviations (and honorifics like "Mr");</li><li>large quantities of boilerplate text, especially for localities, often manually modified.</li></ul><div>I hope that helps get you started.</div>


<div><br></div><div>Cheers,</div><div><br></div><div>Joel</div><div><div class="h5"><br>On Wed, Mar 13, 2013 at 12:26 PM, liling tan <<a href="mailto:alvations@gmail.com" target="_blank">alvations@gmail.com</a>> wrote:<br>

> Dear all,<br>

><br>> Wikipedia dumps have been popular source of texts for NLP due to its<br>

> availability and the sheer size.<br>><br>> I would like to ask whether anyone had conducted quantitative or qualitative<br>> survey on <br>><br>> how useful are these dumps to NLP and <br>> what are the issues that will surface when using wikipedia dumps as corpora.<br>


><br>><br>> Regards,<br>> liling<br>><br></div></div>> _______________________________________________<br>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>


> Corpora mailing list<br>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>


></div>

</div><br>