[Corpora-List] Hindi and Arabic Romanization Software

Orion Montoya orion at mdcclv.com
Mon Jan 19 17:32:48 UTC 2009


I can address the Hindi part, even though I could have sworn I had  
done it for Perso-Arabic too, but can't now find evidence of that:

I think the standard thing to do would be to use IBM's ICU: http://www-01.ibm.com/software/globalization/icu/index.jsp 
; I see there are also Python bindings to ICU at http://pyicu.osafoundation.org/ 
.  But when I had this task in 2003 I couldn't deal with writing code  
using that library, and the ICU 'uconv' binary (which I otherwise  
highly recommend, and which could be enough for your purposes) wasn't  
going to work either.  So I wrote up a Perl system myself, and it  
apparently still in use by the project I wrote it for (http://dsal.uchicago.edu/dictionaries/ 
).  It's pretty tailored to the particular needs of that project, so  
it may or may not be useful to you.

Those needs are, basically, to convert a bunch of dictionary data, a  
mixture of Indic headwords and runons with English text, where the  
Indic stuff is either in ISCII or in UTF-8.  It gives you the option  
of replacing any of the Indic stuff with a transliteration -- either a  
diacriticalized romanization in UTF-8, or that same roman but with  
SGML entities for the diacritics, or a dumbed/normalized 7-bit ASCII  
of the sort a user might try searching on (or the Indic itself in  
whichever of UTF/ISCII encodings it was not initialized with).  ISCII (http://varamozhi.sourceforge.net/iscii91.pdf 
) encodes ten different Indic scripts with a single character set --  
for your purposes, Hindi = Devanagari.

It comes with the Philologic software distribution, at http://philologic.uchicago.edu/philologic3/distribution/philologic-v3.1.t2.tar.gz 
  , in the subdirectory goodies/Obliterator-new.  There are a couple  
of sample scripts and some POD documentation embedded.  Philologic is  
licensed AGPL; Obliterator is under the Perl Artistic License (I think  
because it inherited the license of some code I adapted).

Whether it will be more or less useful than uconv depends on whether  
you're just encoding straight, running text; for mixed text the Python  
bindings to ICU may well serve you better now -- and will handle  
Arabic with the same level of effort, instead of requiring a whole,  
separate implementation. But if Obliterator is the kind of thing you  
like, it's the kind of thing you like.

(The clever name: "Ob-" because it uses objects, "-literator" because  
it transliterates (and transcodes)).

Yours,

Orion


On Jan 19, 2009, at 8:23 AM, WHITELOCK, Pete wrote:

>
> Can anyone recommend software for Romanizing Hindi text and/or  
> Arabic (with vowel marks)?
>
> Pete Whitelock
> Data and Resources Development Manager
> Reference Department
> Academic Division
> Oxford University Press
>
>
>
> Oxford University Press (UK) Disclaimer
>
> This message is confidential. You should not copy it or disclose its  
> contents to anyone. You may use and apply the information for the  
> intended purpose only. OUP does not accept legal responsibility for  
> the contents of this message. Any views or opinions presented are  
> those of the author only and not of OUP. If this email has come to  
> you in error, please delete it, along with any attachments. Please  
> note that OUP may intercept incoming and outgoing email  
> communications.
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090119/ba09eade/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list