[Corpora-List] Extracting text from Wikipedia articles

Trevor Jenkins trevor.jenkins at suneidesis.com
Fri Aug 27 18:10:48 UTC 2010


On Fri, 27 Aug 2010, Irina Temnikova <irina.temnikova at gmail.com> wrote:

> Do any of you know of any tool for extracting text specifically from
> Wikipedia articles, besides those for extracting text from HTML pages?
>
> I only need the title and the text, without any of the formal elements
> present in every Wikipedia article (such as "From Wikipedia, the free
> encyclopedia", "This article is about ..", [edit], the list of
> languages,"Main article:","Categories:") and without "Contents", "See also",
> "References", "Notes" and "External links".

Your requirements are rather specific. But as (the English language)
WikiPedia uses a consistent markup scheme with those formal elements named
(either by explicit id or implicit class names in attributes) you might be
able to strip out just the textual content by running a XSLT stylesheet
processor over the download files and delete the junk you don't want.

Regards, Trevor

<>< Re: deemed!



_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list