[Corpora-List] Extracting text from Wikipedia articles
Trevor Jenkins
trevor.jenkins at suneidesis.com
Fri Aug 27 18:10:48 UTC 2010
On Fri, 27 Aug 2010, Irina Temnikova <irina.temnikova at gmail.com> wrote:
> Do any of you know of any tool for extracting text specifically from
> Wikipedia articles, besides those for extracting text from HTML pages?
>
> I only need the title and the text, without any of the formal elements
> present in every Wikipedia article (such as "From Wikipedia, the free
> encyclopedia", "This article is about ..", [edit], the list of
> languages,"Main article:","Categories:") and without "Contents", "See also",
> "References", "Notes" and "External links".
Your requirements are rather specific. But as (the English language)
WikiPedia uses a consistent markup scheme with those formal elements named
(either by explicit id or implicit class names in attributes) you might be
able to strip out just the textual content by running a XSLT stylesheet
processor over the download files and delete the junk you don't want.
Regards, Trevor
<>< Re: deemed!
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list