Unicode (was: Re: ED MEDIA 2003 (conference))

Eric Brunner-Williams in Portland Maine brunner at NIC-NAA.NET
Sat Nov 9 20:14:46 UTC 2002


Oki Heiki,

So you've seen my charset for modern siksika (no diacriticals):

        siksika.charset ::
                {a,h,i,k,m,n,o,p,s,t,w,y:`,acute-vowel}

Here's one for Frantz:

        vowels: {a,i,o}
        diphthongs: {ai,ao}
        combining diacriticals pitch accent: {',_} (ACUTE, UNDERSCORE)
        semivowels: {w,y}
        consonants: {m,n,s,p,t,k,h}

        {{a,i,o},{ai,ao},{',_},{w,y},{m,n,s,p,t,k,h}}

For Abenaki:
        abenaki.charset ::
                {a,c,d,e,i,j,k,l,m,n,o,p,s,t,u,w,z,8:',^}

Now, the rules of the game are if you use A-Za-z, and something else, you
can't ask for 52 code points plus enough for your something else. Here is
the most common bit of code:

        int isascii(c)
        int c;
        {
                return (((c) & ~0x7F) == 0);
        }
A simply 7-bit test to determin if an integer is (possibly) a character.

Suggest how to implement isfoo(c), where "foo" is the encoding used for
the Siksika (or other) character set. This will allow you to know if the
code point (integer c) is a character in your encoding.

Next, how do you propose to implement character comparison, so that you
can lexically sort two strings?

These are just the tips of an iceberg.

Fundamentally, a code-set dependent, albeit universally valid, mechanism
has several potential weaknesses:

        o the code-set may be under-specified (it is, and incrementally
          adding characters and having private use areas does not negate
          this weakness, only mitigate it, and then only partially),
        o the universal character of the mechanism may promote specific
          default policies (it does, ASCII gets preferential treatement,
          and treating similar glyphs as unified characters is preferential
          to printer vendors, not character collators, in Chinese, Korean,
          Japanese, and Vietnamese),
        o the mechanism may be invalid for some code-points or collections
          of code-points (it is, see my earler question about the two errors
          in Tsalagi).

I once was in the Unicode Technical Committee, but now I see some real hard
problems in Unicode-as-implemented, in Unicode-the-theory, and in the folks
that make up the Unicadettes and their approach to glyphs, characters, and
to indigenous intellectual property rights.

I never met anyone who spoke any Algonquin language when I lived in Germany.

Kitakitamatsino,
Eric



More information about the Ilat mailing list