[Corpora-List] Extracting text from Wikipedia articles
Torsten Zesch
zesch at tk.informatik.tu-darmstadt.de
Wed Sep 1 13:22:19 UTC 2010
Hi Irina,
the Java Wikipedia Library (JWPL) contains a parser for the MediaWiki syntax that allows you (among other things) to access the plain-text of a Wikipedia article:
http://www.ukp.tu-darmstadt.de/software/jwpl/
-Torsten
Von: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] Im Auftrag von Irina Temnikova
Gesendet: Freitag, 27. August 2010 19:52
An: corpora at uib.no
Betreff: [Corpora-List] Extracting text from Wikipedia articles
Dear CORPORA mailing list members,
Do any of you know of any tool for extracting text specifically from Wikipedia articles, besides those for extracting text from HTML pages?
I only need the title and the text, without any of the formal elements present in every Wikipedia article (such as "From Wikipedia, the free encyclopedia", "This article is about ..", [edit], the list of languages,"Main article:","Categories:") and without "Contents", "See also", "References", "Notes" and "External links".
Can you give me any suggestions?
Thank you very much in advance,
Irina
Irina Temnikova
PhD Student in Computational Linguistics
Editorial Assistant for the Journal of Natural Language Engineering
Research Group in Computational Linguistics
Research Institute of Information and Language Processing
University of Wolverhampton, UK
--
If you want to build a ship, don't drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea. (Antoine de Saint-Exupery)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20100901/9af6090a/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list