forum

James Crippen jcrippen at GMAIL.COM
Wed Feb 27 08:16:53 UTC 2008


On Tue, Feb 26, 2008 at 1:09 PM, William J Poser <wjposer at ldc.upenn.edu> wrote:
> Regarding the Apachean characters that "are not directly supported" by
>  Unicode, I can't speak for Mia but when I've heard such things before
>  it usually means that Unicode does not provide a single codepoint for
>  the character.

A similar situation exists for several Northwest Coast languages,
particularly the on I work with, Tlingit. With these languages the
issue is the combining macron below U+0331, in text ˍ, which was
developed back in the Bad Old Days of typewriters where backspace and
overstrike was a convenient way of extending the Latin alphabet. The
diacritic is available in a few precomposed characters in the Latin
Extended Additional rage (U+1E00 to U+1EFF), namely with B/b, D/d,
K/k, L/l, N/n, R/r, T/t, and Z/z, as well as h (but not H!). For
Tlingit the popular orthography requires the combining macron under
both G/g and X/x as well as K/k. Since the former two pairs aren't
precomposed most fonts display them unacceptably badly, if they
actually include the (admittedly obscure) U+0331 diacritic.

I tested all of the fonts in a default Windows XP installation and
only found that only MS Sans Serif and Tahoma support the
aforementioned characters, and they display U+0331 halfway between the
intended glyph and the following glyph. This is probably a font
problem but it could also be the renderer, I'm not absolutely certain.
Lucida Sans Unicode gets U+0331 correct, but lacks the precomposed
U+1E34 and U+1E35 (k with line below). The fonts from SIL, Doulos,
Charis, and Gentium, all work correctly, but their line heights are
unpleasantly large for some people and don't work well as system
fonts.

Apple does a much better job of this in Mac OS X 10.5, supporting the
diacritic in a number of fonts. They did a particularly good job with
Lucida Grande, which is the standard system font. In addition,
Helvetica, Courier, Geneva, American Typewriter, and Bradley Hand work
as expected. I had little or no problem with creating a keyboard
layout and using it for my daily work.

Unfortunately, I can't reliably use Unicode-encoded Tlingit in email
or documents which I intend to share with others, since I can't ensure
that the characters will be even close to viewable for them. As it
stands all the Tlingit writing computer users I know instead use
underline markup to fulfill their needs, but this of course breaks in
any sort of operation that doesn't preserve markup, like copy-paste
from web browsers, for example.

One other problem I've encountered which is a more thorny issue is how
to deal with the unfortunate combination of U+0331 macron-below and
the latin small letter g. All the fonts I've seen display the
diacritic below the descender. The original intention of the
orthography designers was clearly to have an underscore overstruck on
the descender of the g, as can be seen in a few Tlingit works out
there. In my experience the diacritic is often rendered in such a way
so that it's invisible, chopped off by the following line.

Unicode provides a character that, on the face of it, should solve
this problem, namely U+01E5 Latin small letter g with stroke. IIRC in
the Unicode standards documents it appears with the stroke through the
descender, and hence looks something like what Tlingit users expect.
However, not only does this character have even less support in the
font world (several of the aforementioned fonts lack this character),
font developers have also placed the stroke through the right stem of
the letter, or even through the upper bowl. Its capital counterpart
U+01E4 has a stroke through the stubby arm of the G, rendering it
unacceptable as a proper case pair.

To cope with all of this, in my documents I've chosen to use U+1E21
Latin small letter g with macron (above) as the lowercase form, and
U+0047 U+0331 (G macron-below) as the uppercase form. This of course
breaks case pairing, but it displays properly in most cases, avoids
the disappearing diacritic, and works well enough for my purposes. I'm
not afraid of transcoding all of my documents at some point in the
future, but I wouldn't wish that on anyone else so I haven't
promulgated a keyboard layout for either OS to anyone yet.

A thought I had was to design an OpenType font family that had
alternate forms for U+01E4 and U+01E5 which had the proper shape for
the Tlingit orthography. It's a great idea, and although I could
probably hack this into a free font family myself, there's no way that
I'll find the time to actually do so. I thought of asking SIL to try
implementing it, but never made a coherent proposal.

>  It is possible in principle to request the addition of codepoints for
>  such compound characters to Unicode. However, the Unicode Consortium is
>  not thrilled by such requests. As I understand it, they don't like to
>  clutter things up by encoding additional characters unnecessarily. In the
>  cases in which they have done so, the motivation was reportedly
>  consistency with previous character sets. (That is, if an existing
>  encoding had a single codepoint for a character, Unicode also has a
>  single codepoint for it in order to simply conversion between the older
>  encoding and Unicode.)

This rationale of theirs originally didn't bother me, but with the
huge increase in codespace with their additional surrogates, and with
the recent addition of a set of mathematical alphabets in italic,
bold, bold italic, script, blackletter, blackboard bold, sans serif,
bold sans, oblique sans, bold oblique sans, fixed with, bold greek,
italic greek, bold italic greek, sans bold greek, and sans bold italic
greek, I no longer comprehend their resistance to additional
precomposed Latin forms.

Anyway, that's my rant. I'm working on a paper on Tlingit
orthographies which addresses these issues and more. Hopefully I will
be able to present it at the LSA summer meeting, to which my
department (University of Hawai'i at Mānoa) has graciously offered to
send me. This discussion has reminded me that I need to finish the
abstract and paper for it, and figure out some way to sensibly and
coherently explain these sorts of Unicode problems to linguists out
there developing orthographies.

Aatlein gunalchéesh yee yei jinéiyi,
James Crippen



More information about the Ilat mailing list