forum

William J Poser wjposer at LDC.UPENN.EDU
Tue Feb 26 23:09:01 UTC 2008


Regarding the Apachean characters that "are not directly supported" by
Unicode, I can't speak for Mia but when I've heard such things before
it usually means that Unicode does not provide a single codepoint for
the character. For example, Unicode includes lower case <a> with a grave
accent (U+00E0) and lower case <a> with a subscript hook (U+0105) (the
subscript hook is called "ogonek" in Unicode-ese), but does not provide
a single codepoint for lower case <a> with both a grave accent and a
subscript hook, as would be used in Navajo for a high-toned nasalized /a/.
That doesn't mean that you can't get such a character in Unicode.
The sequence U+00E0 U+0328 (lower case a with grave accent followed by
combining ogonek) should be rendered as a lower case a with grave accent
and subscript hook. The problem in such cases is that: (a) the rendering
software and font may not do this properly and (b) operations such as
searching and sorting have to know that this character is represented by
a sequence of two codepoints in order to handle it properly.

It is possible in principle to request the addition of codepoints for
such compound characters to Unicode. However, the Unicode Consortium is
not thrilled by such requests. As I understand it, they don't like to
clutter things up by encoding additional characters unnecessarily. In the
cases in which they have done so, the motivation was reportedly consistency
with previous character sets. (That is, if an existing encoding had a single
codepoint for a character, Unicode also has a single codepoint for it in
order to simply conversion between the older encoding and Unicode.)

Rendering software is getting pretty sophisticated and more and more
programs are adopting one of the sophisticated libraries to do their
rendering, so my impression is that failure to combine combining characters
properly is a problem that will disappear fairly soon. The processing
problems posed by what we want to think of as single characters that
are represented by two or more codepoints are a bit more difficult. There
are libraries that deal with these things, but they make simple/naive
processing harder.

Bill



More information about the Ilat mailing list