[Corpora-List] Fwd: Qualitative / Quantitative survey of Wikipedia dumps as Corpora

Wed Mar 13 05:07:57 UTC 2013

Hi liling,

There are a few surveys of Wikipedia for NLP, but many of the applications
have not focussed on the full text as a corpus, and I know of no survey
that particularly discusses that. However, some of my work is relevant:

My honours research sought to use Wikipedia data as training for Named
Entity Recognition (see references at
http://schwa.org/projects/resources/wiki/Wikiner).

Comparing our results on Wikipedia-trained models to those trained on news
corpora uncovered differences in the text and entity distributions, which
we discuss in our EACL 2009
paper<http://www.aclweb.org/anthology-new/E/E09/E09-1070.pdf> (and
with more data in the
thesis<http://www.joelnothman.com/downloads/honsthesis.pdf>),
and from a different perspective when evaluating NER models on Wikipedia
text (PeoplesWeb 2009<http://www.joelnothman.com/downloads/PeoplesWeb02.pdf>
).

Apart from obvious genre features and topic distribution issues (e.g. a
long, heavy tail of localities, albums, minor personalities; lots of
mathematics), the most notable issues for our work were:

   - a near-absence of abbreviations (and honorifics like "Mr");
   - large quantities of boilerplate text, especially for localities, often
   manually modified.

I hope that helps get you started.

Cheers,

Joel

On Wed, Mar 13, 2013 at 12:26 PM, liling tan <alvations at gmail.com> wrote:
> Dear all,
>
> Wikipedia dumps have been popular source of texts for NLP due to its
> availability and the sheer size.
>
> I would like to ask whether anyone had conducted quantitative or
qualitative
> survey on
>
> how useful are these dumps to NLP and
> what are the issues that will surface when using wikipedia dumps as
corpora.
>
>
> Regards,
> liling
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130313/599e8084/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora