Hi Joel,<div><br></div><div>Thanks for link to the WikiNER. The comparison between wiki and gold corpora is really nice =)</div><div>It would be nice to see that comparison for other task too.</div><div><br></div><div>Regards,</div>
<div>liling<br><br><div class="gmail_quote">On Wed, Mar 13, 2013 at 1:06 PM, Joel Nothman <span dir="ltr"><<a href="mailto:jnothman@student.usyd.edu.au" target="_blank">jnothman@student.usyd.edu.au</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span></span><span></span><a></a>Hi liling,<br><br>There are a few surveys of Wikipedia for NLP, but many of the applications have not focussed on the full text as a corpus, and I know of no survey that particularly discusses that. However, some of my work is relevant:<br>
<br>My honours research sought to use Wikipedia data as training for Named Entity Recognition (see references at <a href="http://schwa.org/projects/resources/wiki/Wikiner" target="_blank">http://schwa.org/projects/resources/wiki/Wikiner</a>).<br>
<br>Comparing our results on Wikipedia-trained models to those trained on news corpora uncovered differences in the text and entity distributions, which we discuss in our <a href="http://www.aclweb.org/anthology-new/E/E09/E09-1070.pdf" target="_blank">EACL 2009 paper</a> (and with more data in the <a href="http://www.joelnothman.com/downloads/honsthesis.pdf" target="_blank">thesis</a>), and from a different perspective when evaluating NER models on Wikipedia text (<a href="http://www.joelnothman.com/downloads/PeoplesWeb02.pdf" target="_blank">PeoplesWeb 2009</a>).<div>
<br></div><div>Apart from obvious genre features and topic distribution issues (e.g. a long, heavy tail of localities, albums, minor personalities; lots of mathematics), the most notable issues for our work were:</div><div>
<ul><li>a near-absence of abbreviations (and honorifics like "Mr");</li><li>large quantities of boilerplate text, especially for localities, often manually modified.</li></ul><div>I hope that helps get you started.</div>
<div><br></div><div>Cheers,</div><div><br></div><div>Joel</div><div><div class="h5"><br>On Wed, Mar 13, 2013 at 12:26 PM, liling tan <<a href="mailto:alvations@gmail.com" target="_blank">alvations@gmail.com</a>> wrote:<br>
> Dear all,<br>
><br>> Wikipedia dumps have been popular source of texts for NLP due to its<br>
> availability and the sheer size.<br>><br>> I would like to ask whether anyone had conducted quantitative or qualitative<br>> survey on <br>><br>> how useful are these dumps to NLP and <br>> what are the issues that will surface when using wikipedia dumps as corpora.<br>
><br>><br>> Regards,<br>> liling<br>><br></div></div>> _______________________________________________<br>> UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
> Corpora mailing list<br>> <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>> <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
></div>
</blockquote></div><br></div>