[Corpora-List] Extract plain text from Wikipedia dump XML format

Motaz SAAD motaz.saad at inria.fr
Fri Jun 22 09:03:26 UTC 2012


Hello, 

You can search google for wiki2plaintext script. you can find it in perl and python 

best, 
Motaz 

----- Original Message -----

> From: "Rahma Sellami" <rahma.sellami at gmail.com>
> To: corpora at uib.no
> Sent: Wednesday, June 20, 2012 7:46:05 PM
> Subject: [Corpora-List] Extract plain text from Wikipedia dump XML
> format

> Hello,

> I downloaded WIkipedia dump XML format, I want to eliminate the
> wikipedia tags to extract the plain text.
> I found the tool wikiprep and I installed it but I do not know what
> script that eliminates the markup wikipedia.

> Thanks --

> RAHMA Sellami

> PhD Computer Science Student
> http://sites.google.com/site/rahmasellami/

> Faculty of Economic Sciences and management of Sfax
> ANLP Research Group
> http://sites.google.com/site/anlprg

> MIRACL Laboratory
> www.miracl.rnu.tn

> Email: rahma.sellami at gmail.com

> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120622/e2c35c8e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list