forum

Thu Feb 28 22:34:10 UTC 2008

Hi Mia,

On 29/02/2008, Mia Kalish <MiaKalish at learningforpeople.us> wrote:
> Hi, Andrew,
>
> You wrote:
>  This is a software issue, collation sorting routines should be able to
>  work on multiple characters, not just single characters.
>
>  If data is normalised, and you have software that has properly
>  implemented Unicode collation and allows you to specify language
>  specific collation, it should be possible to sort a letter that
>  includes a combining diacritic correctly, after all some languages
>  need to be able to sort digraphs and trigraphs correctly as well.
>
>  Its a software limitation. Not a Unicode issue.
>
> ------------------------------------------------------------------
>
>  I don't think this is a multiple character issue, I think it is a sequencing
>  issue. I don't know if "insert" puts the characters into ascending sequence.
>  I believe that the codes are stored as they are inserted. Also, there is the
>  cultural-linguistic intersection: the internal numeric sequence may not be
>  what is wanted for the language; for example, in Apachean, the glottal has
>  to sort first. So overall, I think the concept "correctly" is culturally
>  dependent.
>  .....................................

I'm not quite sure what you are expressing, if you could give me a use
case, I might understand better.

But for me I distinguish between

1) the alphabet (if its an alphabet) and orthography of a language I'm
working with.
2) the Unicode character sequences I need to represent each letter,
character in the language. This may be one to one, or one to many
3) how the data is typed, input into a document. The characters are
stored in logical (Unicode order), some keyboard layouts may act
character by character, so i need to type the characters in the order
i want them stored in the data.

But in some languages I might want to type visually. I.e. the order a
character visually appears and the order it is stored in may be
different.

E.g. in S'gaw Karen ... the medial ra, always follows a consonant in
Unicode data. but visually the medial displays in front of and wraps
around the consonant.

So for some users, they expect to type the medial before the
consonant, a keyboard designed for visual input would reorder the
characters into the proper Unicode sequence after the character
sequence has been typed.

In some African languages, you may have a sequence open-e, combining
macron below, combining diaeresis. Obviously on a simple keyboard some
users may type the macron below first, some would type the macron
below second.

With a simple keyboard layout, you have no control of this, in a more
sophisticated keyboard oyu can reorder the sequences as they type.

One keyman keyboard we developed for Dinka (Sudan) has breathy long
vowels. Breathiness is indicated by a diaeresis, length is indicated
by doubling a vowel, so a long breathy 'a' would be "ää". But you
can't get "aä" or "äa" aɛ ɛeŋuenceɛ, only "aa" or "ää"

So the keyboard allows you to type the long "a", i.e. "aa" and then
the diaeresis key which would insert the diaeresis over both
characters producing "ää".

4) before we store data, in our projects we prefer to normalise data,
esp HTML or XML documents.

5) The next trick, at least for collation and matching is to implement
language specific routines.

Hope that makes sense.

5) is the hard part, because it relies on software developers
implementing all of the Unicode standard not just parts of it. And
providing mechanisms for you to define language specific collation
routines, and identifying what the valid characters are for thta
langauge.

In a locale definition in CLDR you can identify digraphs and trigraphs
as individual letters. You can use the same mechanism to indicate a
base character and one or more combining characters should be treated
as one letter.

That was done in the Yoruba locale for instance.

The issue is whether software developers use or pay attention to this data.

The mechanisms exist in Unicode and LCDR ot handle everything we need
to be handled.

The issue is implementation, whether these features are included, and
whether we can define our own language rules, customise them.

I'd suggest that there are large parts of the Unicode standard not
implemented in most software

Hope this makes sense.

Just my take on it.

Few applications are really suitable for lesser used languages.

Andrew
-- 
Andrew Cunningham
Andrew Cunningham
Vicnet Research and Development Coordinator
State Library of Victoria
Australia

andrewc at vicnet.net.au
lang.support at gmail.com