<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>I can address the Hindi part, even though I could have sworn I had done it for Perso-Arabic too, but can't now find evidence of that:</div><div><br></div><div>I think the standard thing to do would be to use IBM's ICU: <a href="http://www-01.ibm.com/software/globalization/icu/index.jsp">http://www-01.ibm.com/software/globalization/icu/index.jsp</a>; I see there are also Python bindings to ICU at <a href="http://pyicu.osafoundation.org/">http://pyicu.osafoundation.org/</a>. But when I had this task in 2003 I couldn't deal with writing code using that library, and the ICU 'uconv' binary (which I otherwise highly recommend, and which could be enough for your purposes) wasn't going to work either. So I wrote up a Perl system myself, and it apparently still in use by the project I wrote it for (<a href="http://dsal.uchicago.edu/dictionaries/">http://dsal.uchicago.edu/dictionaries/</a>). It's pretty tailored to the particular needs of that project, so it may or may not be useful to you.</div><div><br></div><div>Those needs are, basically, to convert a bunch of dictionary data, a mixture of Indic headwords and runons with English text, where the Indic stuff is either in ISCII or in UTF-8. It gives you the option of replacing any of the Indic stuff with a transliteration -- either a diacriticalized romanization in UTF-8, or that same roman but with SGML entities for the diacritics, or a dumbed/normalized 7-bit ASCII of the sort a user might try searching on (or the Indic itself in whichever of UTF/ISCII encodings it was not initialized with). ISCII (<a href="http://varamozhi.sourceforge.net/iscii91.pdf">http://varamozhi.sourceforge.net/iscii91.pdf</a>) encodes ten different Indic scripts with a single character set -- for your purposes, Hindi = Devanagari.</div><div><br></div><div>It comes with the Philologic software distribution, at <a href="http://philologic.uchicago.edu/philologic3/distribution/philologic-v3.1.t2.tar.gz">http://philologic.uchicago.edu/philologic3/distribution/philologic-v3.1.t2.tar.gz</a> , in the subdirectory goodies/Obliterator-new. There are a couple of sample scripts and some POD documentation embedded. Philologic is licensed AGPL; Obliterator is under the Perl Artistic License (I think because it inherited the license of some code I adapted).</div><div><br></div><div>Whether it will be more or less useful than uconv depends on whether you're just encoding straight, running text; for mixed text the Python bindings to ICU may well serve you better now -- and will handle Arabic with the same level of effort, instead of requiring a whole, separate implementation. But if Obliterator is the kind of thing you like, it's the kind of thing you like.</div><div><br></div><div>(The clever name: "Ob-" because it uses objects, "-literator" because it transliterates (and transcodes)).</div><div><br></div><div>Yours,</div><div><br></div><div>Orion</div><div><br></div><div><br></div><div><div>On Jan 19, 2009, at 8:23 AM, WHITELOCK, Pete wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"> <div> <!-- Converted from text/rtf format --> <br><p><font face="Arial">Can anyone recommend software for Romanizing Hindi text and/or Arabic (with vowel marks)?</font> </p><p><span lang="en-gb"><font size="2" face="Arial">Pete Whitelock</font></span> <br><span lang="en-gb"><font size="2" face="Arial">Data and Resources Development Manager</font></span> <br><span lang="en-gb"><font size="2" face="Arial">Reference Department</font></span> <br><span lang="en-gb"><font size="2" face="Arial">Academic Division</font></span> <br><span lang="en-gb"><font size="2" face="Arial">Oxford University Press</font></span> </p> <br> <br><p>Oxford University Press (UK) Disclaimer</p><p>This message is confidential. You should not copy it or disclose its contents to anyone. You may use and apply the information for the intended purpose only. OUP does not accept legal responsibility for the contents of this message. Any views or opinions presented are those of the author only and not of OUP. If this email has come to you in error, please delete it, along with any attachments. Please note that OUP may intercept incoming and outgoing email communications.</p> </div> _______________________________________________<br>Corpora mailing list<br><a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br><a href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a><br></blockquote></div><br></body></html>