[Corpora-List] Extracting text from Wikipedia articles

Sven Hartrumpf Sven.Hartrumpf at FernUni-Hagen.de
Sat Aug 28 18:47:59 UTC 2010


Hi all.

Fri, 27 Aug 2010 18:52:03 +0100, irina.temnikova wrote:
> Do any of you know of any tool for extracting text specifically from Wikipedia articles,
> besides those for extracting text from HTML pages?

We did this with the additional requirement that headings and paragraph starts
are still marked up. We tested our tool only on the German Wikipedia
(dewiki-20100603-pages-articles.xml); sample results can be seen here:

http://ki220.fernuni-hagen.de/wikipedia/de/20100603/

Greetings
Sven

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list