forum

Andrew Cunningham lang.support at GMAIL.COM
Wed Feb 27 11:07:51 UTC 2008


As far as i know the only case with Latin script using CGI that I'm
aware of is the national library in Germany, to distinguish between
the umlaut and trema

Umlaut is represented by <U+0308>
Trema is represented by <U+034F U+0308>

A better way of handling digraphs would be through locale definitions.
the collation tables should include relevant digraphs, so that sorting
can be handled.

The only time I'd be inclined to use CGI is when you need to
distinguish between an occurrence of two letters that isn't a digraph
form an actual digraph. This was discussed as one possibility for the
romanization of Harari, i.e. to distinguish between tha (as two
Ethiopic characters t+ha) and tha (one Ethiopic character).

On 27/02/2008, Cunliffe D J (AT) <djcunlif at glam.ac.uk> wrote:
> Hello All,
>
>  Just a small contribution to this fascinating insight into language
>  diversity, from the Welsh language, Cymraeg.
>
>  A particular challenge faced when dealing with Welsh are the digraph
>  letters, each of which is composed of two characters - ch, dd, ff, ng,
>  ll, ph, rh, th.
>
>  The Welsh Language Board suggests that the 'Combining Grapheme Joiner'
>  can be used to "stick" the two characters together. They note that this
>  is fairly obscure!
>
>  Do any other languages face this problem, has the 'Combining Grapheme
>  Joiner' actually been built into any applications?
>
>  There are a number of interesting issues around sort orders, how to sort
>  Welsh words and English words together (differently for different
>  audiences) and character counts. If you are interested, there is an
>  excellent document discussing these issues and wider issues around the
>  design of bilingual software, from the Welsh Language Board:
>  http://www.bwrdd-yr-iaith.org.uk/cynnwys.php?pID=109&langID=2&nID=2063
>
>  Cheers,
>
>  Daniel.
>
>
>
>  -----Original Message-----
>  From: Indigenous Languages and Technology
>
> [mailto:ILAT at LISTSERV.ARIZONA.EDU] On Behalf Of William J Poser
>  Sent: 27 Chwefror 2008 00:15
>  To: ILAT at LISTSERV.ARIZONA.EDU
>  Subject: Re: [ILAT] forum
>
>
> Andrew,
>
>  I agree except that it DOES matter whether a character is available
>  precomposed. The problem of multiple representations is indeed solved
>  by the use of normalization, though it is taking a while for
>  normalization
>  libraries to become available for all languages and for all software
>  that should be using them to use them. But even with normalization,
>  it is an additional pain to process text in which some characters
>  require two or three codepoints while some require only one. Not that
>  it can't be done, but it makes life more difficult.
>
>  Bill
>


-- 
Andrew Cunningham
Andrew Cunningham
Vicnet Research and Development Coordinator
State Library of Victoria
Australia

andrewc at vicnet.net.au
lang.support at gmail.com



More information about the Ilat mailing list