hello,<div><br></div><div>In Corpora List, there is another post with the similar topic. You can find it here</div><div><a href="http://mailman.uib.no/public/corpora/2010-September/011285.html" target="_blank">http://mailman.uib.no/public/corpora/2010-September/011285.html</a></div>
<div><br></div><div>I am working on Wikipedia dump and found out following tool is also suitable</div><div><a href="http://code.google.com/p/wikixmlj/" target="_blank">code.google.com/p/wikixmlj/</a></div><div><br></div>
<div>Regards,</div><div><span style><br></span></div><div><span style>Nasrin Baratalipour,</span></div><div><span style>Natural Language and text Processing </span><span class="il" style="font-family:'Times New Roman';color:rgb(34,34,34);background-color:rgb(255,255,204)">Laboratory</span><span style>(</span><a href="http://ece.ut.ac.ir/NLP" target="_blank" style>http://ece.ut.ac.ir/<span class="il" style="background-color:rgb(255,255,204);color:rgb(34,34,34);background-repeat:initial initial">NLP</span></a><span style>),</span></div>
<div><span style>School of Electrical and Computer Engineering,</span></div><div><span style>College of Engineering, University of Tehran, Tehran, Iran</span></div><div><br></div><div><br><div class="gmail_quote">On Wed, Jun 20, 2012 at 10:16 PM, Rahma Sellami <span dir="ltr"><<a href="mailto:rahma.sellami@gmail.com" target="_blank">rahma.sellami@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<div><br><div><div>I downloaded WIkipedia dump XML format, I want to eliminate the wikipedia tags to extract the plain text.</div>
<div>I found the tool <font face="georgia, serif">wikiprep </font>and I installed it but I do not know what script that eliminates the markup wikipedia.</div>
<div><br></div><div>Thanks</div>-- <br><div dir="ltr" style="text-align:left"><span></span><span></span></div><div dir="ltr"><br></div><div dir="ltr">RAHMA Sellami<br><div style="text-align:left"><span style="font-family:arial,helvetica,sans-serif;border-collapse:collapse">PhD Computer Science Student</span></div>
<div><font face="arial, helvetica, sans-serif"><span style="border-collapse:collapse"><a href="http://sites.google.com/site/rahmasellami/" target="_blank">http://sites.google.com/site/rahmasellami/</a></span></font></div>
<div><font face="arial, helvetica, sans-serif"><span style="border-collapse:collapse"><a href="http://sites.google.com/site/rahmasellami/" target="_blank"></a><br></span></font>Faculty of Economic Sciences and management of Sfax<br>
ANLP Research Group<br><a href="http://sites.google.com/site/anlprg" target="_blank">http://sites.google.com/site/anlprg</a><br><br>MIRACL Laboratory<br><a href="http://www.miracl.rnu.tn" target="_blank">www.miracl.rnu.tn</a><br>
<br>Email: <a href="mailto:rahma.sellami@gmail.com" target="_blank">rahma.sellami@gmail.com</a></div></div><br>
</div></div>
<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br></div>