[Corpora-List] Qualitative / Quantitative survey of Wikipedia dumps as Corpora

liling tan alvations at gmail.com
Fri Mar 22 01:38:29 UTC 2013


Hi Joel,

Thanks for link to the WikiNER. The comparison between wiki and gold
corpora is really nice =)
It would be nice to see that comparison for other task too.

Regards,
liling

On Wed, Mar 13, 2013 at 1:06 PM, Joel Nothman
<jnothman at student.usyd.edu.au>wrote:

> Hi liling,
>
> There are a few surveys of Wikipedia for NLP, but many of the applications
> have not focussed on the full text as a corpus, and I know of no survey
> that particularly discusses that. However, some of my work is relevant:
>
> My honours research sought to use Wikipedia data as training for Named
> Entity Recognition (see references at
> http://schwa.org/projects/resources/wiki/Wikiner).
>
> Comparing our results on Wikipedia-trained models to those trained on news
> corpora uncovered differences in the text and entity distributions, which
> we discuss in our EACL 2009 paper<http://www.aclweb.org/anthology-new/E/E09/E09-1070.pdf> (and
> with more data in the thesis<http://www.joelnothman.com/downloads/honsthesis.pdf>),
> and from a different perspective when evaluating NER models on Wikipedia
> text (PeoplesWeb 2009<http://www.joelnothman.com/downloads/PeoplesWeb02.pdf>
> ).
>
> Apart from obvious genre features and topic distribution issues (e.g. a
> long, heavy tail of localities, albums, minor personalities; lots of
> mathematics), the most notable issues for our work were:
>
>    - a near-absence of abbreviations (and honorifics like "Mr");
>    - large quantities of boilerplate text, especially for localities,
>    often manually modified.
>
> I hope that helps get you started.
>
> Cheers,
>
> Joel
>
> On Wed, Mar 13, 2013 at 12:26 PM, liling tan <alvations at gmail.com> wrote:
> > Dear all,
> >
> > Wikipedia dumps have been popular source of texts for NLP due to its
> > availability and the sheer size.
> >
> > I would like to ask whether anyone had conducted quantitative or
> qualitative
> > survey on
> >
> > how useful are these dumps to NLP and
> > what are the issues that will surface when using wikipedia dumps as
> corpora.
> >
> >
> > Regards,
> > liling
> >
> > _______________________________________________
> > UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130322/1c7a3f4d/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list