[Lexicog] Tone languages in Toolbox/Shoebox

Mike Maxwell maxwell at LDC.UPENN.EDU
Sat Apr 8 16:15:56 UTC 2006


neduchi at netscape.net wrote:
> My first language, Igbo, is also a tone language. The identified tones 
> are High, Low, and Downstep; marked with < ´ >, < `>, and < - > 
> respectively. I have not found ANY software that can handle a 
> combination of the tone marks with specific alphabets as single units. 
> I have explored some options and found out the following:
> 
> (1) Unicode:
> The Unicode angle has been closed. It is no longer possible to add such 
> combined symbols/characters as 'new' Unicode character.
> 
> (2) Fonts and Operating Systems:
> You can use specific Unicode-based fonts, like Doulos, CODE2000, 
> Gentium etc. to achieve this character combination. It works fine on 
> Windows OS with service park 2; Microsoft programs on the system also 
> handle the combined characters as single units. But that's not of much 
> help to the linguist: other software not produced by Microsoft do not 
> work that find.
> 
> (3) Font Solution:
> Since the Unicode angle has been blocked, the buck has actually been 
> passed to those involved in "Font Design". It is now the font 
> developers that are to design the identified characters as 
> "pre-composed" characters. A font like "Doulos" should then contain 
> these "pre-composed" characters. That's when Shoebox, Toolbox, or any 
> other software can handle them the way we linguists would like it.
> 
> On the other hand, since Doulos has now been made available under the 
> OpenFont License, 
> (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL), may 
> be some font designers might be in the position to add the identified 
> characters of African languages as "pre-composed" characters. That 
> should be the permanent solution to your problem and MY problem.

I don't have any answer for the original posters' problem, but I am 
puzzled by the above discussion of composed vs. non-composed characters.

You are correct that Unicode will probably not be adding pre-composed 
characters.  That is, while some combinations of diacritic + base 
character already exist, combinations beyond those will probably need to 
be represented in Unicode as a sequence of two or more characters.

First: I am surprised that you have had problems with representing the 
characters in Igbo as pre-composed characters, but maybe I don't 
understand what you're saying.  There are certainly code points 
(characters) in Unicode for pre-composed vowel + grave accent, vowel + 
acute accent, and vowel + macron.  Or are you needing to represent these 
vowel + accent together with a dot under the vowels (as in Yoruba)?

Second: Unicode-aware software ought to be able to handle 
non-precomposed characters just fine.  More on this below.

Third: As for adding pre-composed characters to fonts where the 
pre-composed fonts are not already in Unicode (I believe that's what 
you're suggesting in (3)), that is not likely to happen, and would in my 
opinion be a bad idea if it did happen.  The reason of the development 
of Unicode is to define a universal (well, global) way to represent 
characters in writing systems.  Adding pre-composed characters to what 
was otherwise a Unicode font would be creating a non-standard, which 
would cause problems for anyone trying to read your data who didn't 
happen to have your font.  Likewise if you added pre-composed characters 
to some sort of 8-bit encoding (analogous to all the ISO 8859-N 
encodings running around, or even worse, to the proprietary and 
undocumented Indic fonts, which are truly a mess).

There is one exception to this: Unicode sets aside an area for 
user-defined characters (the Private Use Area, PUA).  I believe this is 
intended for characters that simply cannot be represented by combining 
other Unicode characters--for Klingon, say.  Again, while font designers 
_could_ use this for characters that are not defined as pre-composed in 
Unicode, the designers of such fonts are unlikely to do so, because they 
don't want to get caught up in creating and supporting non-standards, 
and because as far as they are concerned, any software worth its salt 
ought to support un-composed characters.

So let me come back to the problems you say you're having using Unicode 
with non-pre-composed characters.  Most Unicode-capable software, 
whether produced by Microsoft or not, should be capable of dealing with 
non-composed characters.  Is there some specific software that doesn't 
deal with non-composed characters that you need for linguistic work? 
(Besides Toolbox--I'll have to let someone else address how Toolbox 
works with this.)

    Mike Maxwell


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list