[Lexicog] Tone languages in Toolbox/Shoebox
Mike Maxwell
maxwell at LDC.UPENN.EDU
Sat Apr 8 16:15:56 UTC 2006
neduchi at netscape.net wrote:
> My first language, Igbo, is also a tone language. The identified tones
> are High, Low, and Downstep; marked with < ´ >, < `>, and < - >
> respectively. I have not found ANY software that can handle a
> combination of the tone marks with specific alphabets as single units.
> I have explored some options and found out the following:
>
> (1) Unicode:
> The Unicode angle has been closed. It is no longer possible to add such
> combined symbols/characters as 'new' Unicode character.
>
> (2) Fonts and Operating Systems:
> You can use specific Unicode-based fonts, like Doulos, CODE2000,
> Gentium etc. to achieve this character combination. It works fine on
> Windows OS with service park 2; Microsoft programs on the system also
> handle the combined characters as single units. But that's not of much
> help to the linguist: other software not produced by Microsoft do not
> work that find.
>
> (3) Font Solution:
> Since the Unicode angle has been blocked, the buck has actually been
> passed to those involved in "Font Design". It is now the font
> developers that are to design the identified characters as
> "pre-composed" characters. A font like "Doulos" should then contain
> these "pre-composed" characters. That's when Shoebox, Toolbox, or any
> other software can handle them the way we linguists would like it.
>
> On the other hand, since Doulos has now been made available under the
> OpenFont License,
> (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL), may
> be some font designers might be in the position to add the identified
> characters of African languages as "pre-composed" characters. That
> should be the permanent solution to your problem and MY problem.
I don't have any answer for the original posters' problem, but I am
puzzled by the above discussion of composed vs. non-composed characters.
You are correct that Unicode will probably not be adding pre-composed
characters. That is, while some combinations of diacritic + base
character already exist, combinations beyond those will probably need to
be represented in Unicode as a sequence of two or more characters.
First: I am surprised that you have had problems with representing the
characters in Igbo as pre-composed characters, but maybe I don't
understand what you're saying. There are certainly code points
(characters) in Unicode for pre-composed vowel + grave accent, vowel +
acute accent, and vowel + macron. Or are you needing to represent these
vowel + accent together with a dot under the vowels (as in Yoruba)?
Second: Unicode-aware software ought to be able to handle
non-precomposed characters just fine. More on this below.
Third: As for adding pre-composed characters to fonts where the
pre-composed fonts are not already in Unicode (I believe that's what
you're suggesting in (3)), that is not likely to happen, and would in my
opinion be a bad idea if it did happen. The reason of the development
of Unicode is to define a universal (well, global) way to represent
characters in writing systems. Adding pre-composed characters to what
was otherwise a Unicode font would be creating a non-standard, which
would cause problems for anyone trying to read your data who didn't
happen to have your font. Likewise if you added pre-composed characters
to some sort of 8-bit encoding (analogous to all the ISO 8859-N
encodings running around, or even worse, to the proprietary and
undocumented Indic fonts, which are truly a mess).
There is one exception to this: Unicode sets aside an area for
user-defined characters (the Private Use Area, PUA). I believe this is
intended for characters that simply cannot be represented by combining
other Unicode characters--for Klingon, say. Again, while font designers
_could_ use this for characters that are not defined as pre-composed in
Unicode, the designers of such fonts are unlikely to do so, because they
don't want to get caught up in creating and supporting non-standards,
and because as far as they are concerned, any software worth its salt
ought to support un-composed characters.
So let me come back to the problems you say you're having using Unicode
with non-pre-composed characters. Most Unicode-capable software,
whether produced by Microsoft or not, should be capable of dealing with
non-composed characters. Is there some specific software that doesn't
deal with non-composed characters that you need for linguistic work?
(Besides Toolbox--I'll have to let someone else address how Toolbox
works with this.)
Mike Maxwell
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list