[Corpora-List] Extracting text from Wikipedia articles

Roman Klinger roman.klinger at scai.fraunhofer.de
Fri Aug 27 18:15:03 UTC 2010


Hi Irina ;-),

On 08/27/2010 07:52 PM, Irina Temnikova wrote:
> Dear CORPORA mailing list members,
>
> Do any of you know of any tool for extracting text specifically from
> Wikipedia articles, besides those for extracting text from HTML pages?
>
> I only need the title and the text, without any of the formal elements
> present in every Wikipedia article (such as "From Wikipedia, the free
> encyclopedia", "This article is about ..", [edit], the list of
> languages,"Main article:","Categories:") and without "Contents", "See
> also", "References", "Notes" and "External links".
>
> Can you give me any suggestions?

Users can add arbitrary HTML code. If you want to interpret that (to get 
the plain text) you could use the text based web browser lynx, which can 
dump to a text file. That works quite well, but is a HTML extraction 
method you excluded.

Another approach a colleague pointed me to and told me to work -- I did 
not try it by myself -- is described here:
http://evanjones.ca/software/wikipedia2text.html

Best,
  Roman

-- 
Roman Klinger
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Department of Bioinformatics
Schloss Birlinghoven
D-53754 Sankt Augustin
Tel.: +49-2241-14-2360
Fax.: +49-2241-14-4-2360
email: roman.klinger at scai.fhg.de
http://www.scai.fraunhofer.de/klinger.html

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list