[Corpora-List] Extracting text from Wikipedia articles
Roman Klinger
roman.klinger at scai.fraunhofer.de
Fri Aug 27 18:15:03 UTC 2010
Hi Irina ;-),
On 08/27/2010 07:52 PM, Irina Temnikova wrote:
> Dear CORPORA mailing list members,
>
> Do any of you know of any tool for extracting text specifically from
> Wikipedia articles, besides those for extracting text from HTML pages?
>
> I only need the title and the text, without any of the formal elements
> present in every Wikipedia article (such as "From Wikipedia, the free
> encyclopedia", "This article is about ..", [edit], the list of
> languages,"Main article:","Categories:") and without "Contents", "See
> also", "References", "Notes" and "External links".
>
> Can you give me any suggestions?
Users can add arbitrary HTML code. If you want to interpret that (to get
the plain text) you could use the text based web browser lynx, which can
dump to a text file. That works quite well, but is a HTML extraction
method you excluded.
Another approach a colleague pointed me to and told me to work -- I did
not try it by myself -- is described here:
http://evanjones.ca/software/wikipedia2text.html
Best,
Roman
--
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Department of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list