[Lexicog] UNICODE
Mike Maxwell
maxwell at LDC.UPENN.EDU
Tue Sep 13 16:59:04 UTC 2005
Jimm GoodTracks wrote:
> ...The key is to have easy and reliable application software programs
> available, in order to do the conversions without it becoming a chore.
Agreed. The worst case is an undocumented hacked font for more characters
than fit into 8 bits, such as are used for many Indic languages. (What is
done is to either encode a single character in more than one byte--this is
what is done in Unicode--or to encode pieces of the glyphs in a single
byte, then combine these pieces, rather as if 'b' was composed out of 'l'
and 'o', and 'd' was composed out of 'o' and 'l'.) Figuring out how these
fonts work can be a major task.
The next worst case--and the one we had for Yoruba--is where there is a
hacked 8-bit font, and all the characters fit into the 8 bits. Then you
just create a text with all 255 code points, and eyeball it to figure out
how to recreate the characters in Unicode. In this case, the code for the
encoding converter is little more than a table.
The difficulty for me lay in figuring out the correct order for the stacked
diacritics, i.e. interpreting the normalization standard. (I got it wrong,
and when I got it right, the font didn't work :-).) Note that this is not
a problem for all languages, since many writing systems don't have stacked
diacritics.
The easy case is when you've used a standard 8-bit font, in which case
there is almost certain to be an encoding converter already available, such
as iconv or SIL's converter from the old Doulos IPA fonts to Unicode IPA.
The other consideration is when you've used different fonts in different
places, e.g. in different fields in your dictionary. If some code points
are used differently in those fonts, then you need to run encoding
conversion over some fields (such as fields which contain data in the
target language) and not others (such as English gloss fields). The SIL
encoding converters are generally set up to allow this, particularly where
the differences are represented as SFM-marked (Shoebox style) fields.
Mike Maxwell
CASL (formerly LDC)
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list