[Corpora-List] Extracting text from Wikipedia articles

Irina Temnikova irina.temnikova at gmail.com
Fri Aug 27 17:52:03 UTC 2010


Dear CORPORA mailing list members,

Do any of you know of any tool for extracting text specifically from
Wikipedia articles, besides those for extracting text from HTML pages?

I only need the title and the text, without any of the formal elements
present in every Wikipedia article (such as "From Wikipedia, the free
encyclopedia", "This article is about ..", [edit], the list of
languages,"Main article:","Categories:") and without "Contents", "See also",
"References", "Notes" and "External links".

Can you give me any suggestions?

Thank you very much in advance,

Irina

Irina Temnikova

PhD Student in Computational Linguistics
Editorial Assistant for the Journal of Natural Language Engineering
Research Group in Computational Linguistics

Research Institute of Information and Language Processing
University of Wolverhampton, UK


-- 
If you want to build a ship, don't drum up the men to gather wood, divide
the work and give orders. Instead, teach them to yearn for the vast and
endless sea. (Antoine de Saint-Exupery)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100827/88149745/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list