forum

William J Poser wjposer at LDC.UPENN.EDU
Thu Feb 28 22:41:23 UTC 2008


Mia wrote:
>Also, there is the cultural-linguistic intersection: the internal
>numeric sequence may not be what is wanted for the language;
>for example, in Apachean, the glottal has to sort first.

Exactly. Unicode understands this full well. There is no assumption
that the order of Unicode codepoints is the order in which they
should sort (beyond the fact that this is what you will get if
you sort naively). Unless it just happens that the numerical order
is what you need, in order to sort correctly you must use software
that can be given a specification of the sort order to use.

>I don't know if "insert" puts the characters into ascending sequence.
>I believe that the codes are stored as they are inserted. 

I have the impression that two sorts of ordering are getting conflated
here. One sort of ordering is the ordering used when sorting (or in
Unicode terms, "collating"). This is language-specific. The other sort
of ordering is the order in which certain codepoints must appear in
Unicode, e.g. the fact that a combining character such as an acute
accent must follow the base character, such as an <a>, and the fact
that in normalized Unicode text the combining characters must appear
in a certain order. This second kind of ordering is defined by
Unicode. It is NOT language specific. From the Unicode point of
view, it is up to Apaches to decide whether glottal stop should sort
first, but it is not up to them to decide on the order in which
combining characters occur.

However, it is okay if your input software produces sequences that
are out of order, so long as the text is normalized before you do
things like sorting. 

Bill



More information about the Ilat mailing list