[Corpora-List] Extracting text from Wikipedia articles

Torsten Zesch zesch at tk.informatik.tu-darmstadt.de
Wed Sep 1 13:22:19 UTC 2010


Hi Irina,

the Java Wikipedia Library (JWPL) contains a parser for the MediaWiki syntax that allows you (among other things) to access the plain-text of a Wikipedia article:
http://www.ukp.tu-darmstadt.de/software/jwpl/

-Torsten

Von: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] Im Auftrag von Irina Temnikova
Gesendet: Freitag, 27. August 2010 19:52
An: corpora at uib.no
Betreff: [Corpora-List] Extracting text from Wikipedia articles

Dear CORPORA mailing list members,

Do any of you know of any tool for extracting text specifically from Wikipedia articles, besides those for extracting text from HTML pages?

I only need the title and the text, without any of the formal elements present in every Wikipedia article (such as "From Wikipedia, the free encyclopedia", "This article is about ..", [edit], the list of languages,"Main article:","Categories:") and without "Contents", "See also", "References", "Notes" and "External links".

Can you give me any suggestions?

Thank you very much in advance,

Irina



Irina Temnikova



PhD Student in Computational Linguistics

Editorial Assistant for the Journal of Natural Language Engineering

Research Group in Computational Linguistics







Research Institute of Information and Language Processing

University of Wolverhampton, UK

--
If you want to build a ship, don't drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea. (Antoine de Saint-Exupery)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100901/9af6090a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list