[Corpora-List] Extracting text from Wikipedia articles
Sven Hartrumpf
Sven.Hartrumpf at FernUni-Hagen.de
Sat Aug 28 18:47:59 UTC 2010
Hi all.
Fri, 27 Aug 2010 18:52:03 +0100, irina.temnikova wrote:
> Do any of you know of any tool for extracting text specifically from Wikipedia articles,
> besides those for extracting text from HTML pages?
We did this with the additional requirement that headings and paragraph starts
are still marked up. We tested our tool only on the German Wikipedia
(dewiki-20100603-pages-articles.xml); sample results can be seen here:
http://ki220.fernuni-hagen.de/wikipedia/de/20100603/
Greetings
Sven
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list