[Corpora-List] Extract plain text from Wikipedia dump XML format

Rahma Sellami rahma.sellami at gmail.com
Wed Jun 20 17:46:05 UTC 2012


Hello,

I downloaded WIkipedia dump XML format, I want to eliminate the wikipedia
tags to extract the plain text.
I found the tool wikiprep and I installed it but I do not know what script
that eliminates the markup wikipedia.

Thanks
-- 

RAHMA Sellami
PhD Computer Science Student
http://sites.google.com/site/rahmasellami/
<http://sites.google.com/site/rahmasellami/>
Faculty of Economic Sciences and management of Sfax
ANLP Research Group
http://sites.google.com/site/anlprg

MIRACL Laboratory
www.miracl.rnu.tn

Email: rahma.sellami at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120620/1980908f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list