[Corpora-List] Extract plain text from Wikipedia dump XML format

Nasrin Baratali nasrin.baratali at gmail.com
Fri Jun 22 12:05:08 UTC 2012


hello,

In Corpora List, there is another post with the similar topic. You can find
it here
http://mailman.uib.no/public/corpora/2010-September/011285.html

I am working on Wikipedia dump and found out following tool is also suitable
code.google.com/p/wikixmlj/

Regards,

Nasrin Baratalipour,
Natural Language and text Processing Laboratory(http://ece.ut.ac.ir/NLP),
School of Electrical and Computer Engineering,
College of Engineering, University of Tehran, Tehran, Iran


On Wed, Jun 20, 2012 at 10:16 PM, Rahma Sellami <rahma.sellami at gmail.com>wrote:

> Hello,
>
> I downloaded WIkipedia dump XML format, I want to eliminate the wikipedia
> tags to extract the plain text.
> I found the tool wikiprep and I installed it but I do not know what
> script that eliminates the markup wikipedia.
>
> Thanks
> --
>
> RAHMA Sellami
> PhD Computer Science Student
> http://sites.google.com/site/rahmasellami/
>  <http://sites.google.com/site/rahmasellami/>
> Faculty of Economic Sciences and management of Sfax
> ANLP Research Group
> http://sites.google.com/site/anlprg
>
> MIRACL Laboratory
> www.miracl.rnu.tn
>
> Email: rahma.sellami at gmail.com
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20120622/9d6ff029/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list