[Corpora-List] Extracting text from Wikipedia articles
Irina Temnikova
irina.temnikova at gmail.com
Fri Aug 27 17:52:03 UTC 2010
Dear CORPORA mailing list members,
Do any of you know of any tool for extracting text specifically from
Wikipedia articles, besides those for extracting text from HTML pages?
I only need the title and the text, without any of the formal elements
present in every Wikipedia article (such as "From Wikipedia, the free
encyclopedia", "This article is about ..", [edit], the list of
languages,"Main article:","Categories:") and without "Contents", "See also",
"References", "Notes" and "External links".
Can you give me any suggestions?
Thank you very much in advance,
Irina
Irina Temnikova
PhD Student in Computational Linguistics
Editorial Assistant for the Journal of Natural Language Engineering
Research Group in Computational Linguistics
Research Institute of Information and Language Processing
University of Wolverhampton, UK
--
If you want to build a ship, don't drum up the men to gather wood, divide
the work and give orders. Instead, teach them to yearn for the vast and
endless sea. (Antoine de Saint-Exupery)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100827/88149745/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list