[Corpora-List] Hindi and Arabic Romanization Software
Orion Montoya
orion at mdcclv.com
Mon Jan 19 17:32:48 UTC 2009
I can address the Hindi part, even though I could have sworn I had
done it for Perso-Arabic too, but can't now find evidence of that:
I think the standard thing to do would be to use IBM's ICU: http://www-01.ibm.com/software/globalization/icu/index.jsp
; I see there are also Python bindings to ICU at http://pyicu.osafoundation.org/
. But when I had this task in 2003 I couldn't deal with writing code
using that library, and the ICU 'uconv' binary (which I otherwise
highly recommend, and which could be enough for your purposes) wasn't
going to work either. So I wrote up a Perl system myself, and it
apparently still in use by the project I wrote it for (http://dsal.uchicago.edu/dictionaries/
). It's pretty tailored to the particular needs of that project, so
it may or may not be useful to you.
Those needs are, basically, to convert a bunch of dictionary data, a
mixture of Indic headwords and runons with English text, where the
Indic stuff is either in ISCII or in UTF-8. It gives you the option
of replacing any of the Indic stuff with a transliteration -- either a
diacriticalized romanization in UTF-8, or that same roman but with
SGML entities for the diacritics, or a dumbed/normalized 7-bit ASCII
of the sort a user might try searching on (or the Indic itself in
whichever of UTF/ISCII encodings it was not initialized with). ISCII (http://varamozhi.sourceforge.net/iscii91.pdf
) encodes ten different Indic scripts with a single character set --
for your purposes, Hindi = Devanagari.
It comes with the Philologic software distribution, at http://philologic.uchicago.edu/philologic3/distribution/philologic-v3.1.t2.tar.gz
, in the subdirectory goodies/Obliterator-new. There are a couple
of sample scripts and some POD documentation embedded. Philologic is
licensed AGPL; Obliterator is under the Perl Artistic License (I think
because it inherited the license of some code I adapted).
Whether it will be more or less useful than uconv depends on whether
you're just encoding straight, running text; for mixed text the Python
bindings to ICU may well serve you better now -- and will handle
Arabic with the same level of effort, instead of requiring a whole,
separate implementation. But if Obliterator is the kind of thing you
like, it's the kind of thing you like.
(The clever name: "Ob-" because it uses objects, "-literator" because
it transliterates (and transcodes)).
Yours,
Orion
On Jan 19, 2009, at 8:23 AM, WHITELOCK, Pete wrote:
>
> Can anyone recommend software for Romanizing Hindi text and/or
> Arabic (with vowel marks)?
>
> Pete Whitelock
> Data and Resources Development Manager
> Reference Department
> Academic Division
> Oxford University Press
>
>
>
> Oxford University Press (UK) Disclaimer
>
> This message is confidential. You should not copy it or disclose its
> contents to anyone. You may use and apply the information for the
> intended purpose only. OUP does not accept legal responsibility for
> the contents of this message. Any views or opinions presented are
> those of the author only and not of OUP. If this email has come to
> you in error, please delete it, along with any attachments. Please
> note that OUP may intercept incoming and outgoing email
> communications.
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090119/ba09eade/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list