Unicode SEA script find/replace utility?

Christopher Fynn cfynn at gmx.net
Sat Jun 5 12:36:38 UTC 2004


navako wrote:

>
> I've been searching around for some combination of software and
> (apple-)script to allow transliteration *between* SEA Unicode scripts
> --of course, the same setup could easily enough convert romanized text
> to and from SEA text.
> Has anyone found a decent solution for this?  In my case, I'm
> interested in switching Pali text between Khmer and Burmese
> (theoretically, I would also want to try to get it working with
> Sinhalese) scripts.  I'm sure someone must have a comparable problem
> switching between modern Thai and Khom (or modern Khmer).
> I'm using a Mac (OS10), but any information for any platform would be
> welcome.
> Eisel M.

Dear Eisel

There  is a transliteration  feature  in the  specification for  AAT
fonts which, in theory at least, should allow you to build an AAT font
with glyphs for all  the target scripts you are interested in - and
switch the display of Pali data encoded using the Unicode characters of
one script with the glyphs of another. While I think this would be an
elegant solution, since it leaves the underlying data intact, it would
be quite a lot of work to implement and only useful for OSX and other
systems using AAT fonts and features. There is no similar feature in
OpenType.

It should be reasonably straightforward for someone to write a PERL or
PYTHON script to trans-code text files between these scripts - a more
sophisticated script might trans-code HTML or XML files on the fly
depending on the users preference. It would probably also be nice to
have a search feature  that would allow you to search or compare strings
of Pali text no matter which script characters were used to encode the text.
If you need to convert RTF,  MS Word or other proprietary format files
from one encoding to another  while retaining formatting etc. - it gets
more complicated.

SIL have something called TECkit <</a low-level toolkit intended to be
used by other applications that need to perform encoding conversions
(e.g., when importing legacy data into a Unicode-based application). The
primary component of the TECkit package is therefore a library that
performs conversions; this is the "TECkit engine." The engine relies on
mapping tables in a specific binary format (for which documentation is
available); there is a compiler that creates such tables from a
human-readable mapping description (a simple text file)./>> available at:
 http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=TECkit
. You will also find some  documents that might be useful at the same site :
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=Conversion

While  TECkit and similar utilities are primarily designed to convert
from legacy encodings to Unicode  it should be reasonably
straightforward to adapt this sort of thing to convert text encoded
using one script block in Unicode to characters in another Unicode block.

ICU http://oss.software.ibm.com/icu/ also  provides routines in C & Java
to do character conversion.

Another resource which may be  useful  is Unicode Technical Report #22
(UTR22) which specifies Character Mapping Markup Language (CharMapML)
"a//n XML format for the interchange of mapping data for character
encodings"  http://www.unicode.org/reports/tr22///


- Chris



More information about the Sealang-l mailing list