[Lexicog] Tone languages in Toolbox/Shoebox

Mon Apr 10 09:53:47 UTC 2006

Maxwell,
You asked: "Or are you needing to represent these  vowel + accent 
together with a dot under the vowels (as in Yoruba)?"

Yes indeed, that is how it is needed for linguistic works; and (if I 
understand the initial poster correctly) that is also the problem the 
initial poster is having. I think the already available solutions 
within Unicode (i.e. Uniocde points) take care of only a small section 
of the problem, which is the vowels without sub-dots.

However, I am actually looking at two issues that converge at the point 
of "font" programs.

The first issue is in terms of individual standalone programs that 
should handle Igbo texts. I have experimented with different programs: 
Concordance, Corpus Presenter, Wordsmith, ToolBox, and Word Sketch 
Engine [WSE] (http://www.sketchengine.co.uk/) etc. Only ToolBox and 
WordSketch Engine can handle the subdotted vowels without problem; but 
a combination of subdotted vowels with the tone marks is simply an 
additional burden and most programs that profess to be Unicode based 
simply handle them as TWO separate characters. In other words, the 
problem STILL persists. WSE is server-based and is not freely 
available. In some programs (that profess to be Unicode based) one 
component or the other would give you question marks for the sub-dotted 
vowels, while another component of the same program would render the 
text as it should. Tone marks have not yet been added!

The second issue is connected with the internet. Netscape, Opera, 
Firefox can all display Unicode texts without tone marks. But a 
subdotted vowel that is combined with a tone mark is displayed as TWO 
separate characters. That is why some Africans prefer to present texts 
of their language that involve such combined characters either as pdf 
or images(screenshots).

Each time I ask around or try to find out a lasting solution, I always 
get the same answer: the solution lies with the font designer; he 
should produce pre-composed characters, for example: " vowel + accent + 
a dot under the vowels".

Is this answer wrong? Or has a Unicode based solution already been 
found?

Chinedu Uchechukwu

-----Original Message-----
From: Mike Maxwell <maxwell at ldc.upenn.edu>
To: lexicographylist at yahoogroups.com
Sent: Sat, 08 Apr 2006 12:15:56 -0400
Subject: Re: [Lexicog]  Tone languages in Toolbox/Shoebox

   neduchi at netscape.net wrote:
  > My first language, Igbo, is also a tone language. The identified 
tones
 > are High, Low, and Downstep; marked with < ´ >, < `>, and < - >
 > respectively. I have not found ANY software that can handle a
  > combination of the tone marks with specific alphabets as single 
units.
 > I have explored some options and found out the following:
 >
 > (1) Unicode:
  > The Unicode angle has been closed. It is no longer possible to add 
such
 > combined symbols/characters as 'new' Unicode character.
 >
 > (2) Fonts and Operating Systems:
 > You can use specific Unicode-based fonts, like Doulos, CODE2000,
 > Gentium etc. to achieve this character combination. It works fine on
  > Windows OS with service park 2; Microsoft programs on the system 
also
  > handle the combined characters as single units. But that's not of 
much
  > help to the linguist: other software not produced by Microsoft do 
not
 > work that find.
 >
 > (3) Font Solution:
 > Since the Unicode angle has been blocked, the buck has actually been
 > passed to those involved in "Font Design". It is now the font
 > developers that are to design the identified characters as
 > "pre-composed" characters. A font like "Doulos" should then contain
  > these "pre-composed" characters. That's when Shoebox, Toolbox, or 
any
 > other software can handle them the way we linguists would like it.
 >
  > On the other hand, since Doulos has now been made available under 
the
 > OpenFont License,
  > (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL), 
may
  > be some font designers might be in the position to add the 
identified
 > characters of African languages as "pre-composed" characters. That
 > should be the permanent solution to your problem and MY problem.

 I don't have any answer for the original posters' problem, but I am
  puzzled by the above discussion of composed vs. non-composed 
characters.

 You are correct that Unicode will probably not be adding pre-composed
 characters. That is, while some combinations of diacritic + base
  character already exist, combinations beyond those will probably need 
to
 be represented in Unicode as a sequence of two or more characters.

 First: I am surprised that you have had problems with representing the
 characters in Igbo as pre-composed characters, but maybe I don't
 understand what you're saying. There are certainly code points
 (characters) in Unicode for pre-composed vowel + grave accent, vowel +
  acute accent, and vowel + macron. Or are you needing to represent 
these
 vowel + accent together with a dot under the vowels (as in Yoruba)?

 Second: Unicode-aware software ought to be able to handle
 non-precomposed characters just fine. More on this below.

 Third: As for adding pre-composed characters to fonts where the
 pre-composed fonts are not already in Unicode (I believe that's what
  you're suggesting in (3)), that is not likely to happen, and would in 
my
 opinion be a bad idea if it did happen. The reason of the development
 of Unicode is to define a universal (well, global) way to represent
 characters in writing systems. Adding pre-composed characters to what
 was otherwise a Unicode font would be creating a non-standard, which
 would cause problems for anyone trying to read your data who didn't
  happen to have your font. Likewise if you added pre-composed 
characters
 to some sort of 8-bit encoding (analogous to all the ISO 8859-N
 encodings running around, or even worse, to the proprietary and
 undocumented Indic fonts, which are truly a mess).

 There is one exception to this: Unicode sets aside an area for
 user-defined characters (the Private Use Area, PUA). I believe this is
 intended for characters that simply cannot be represented by combining
  other Unicode characters--for Klingon, say. Again, while font 
designers
  _could_ use this for characters that are not defined as pre-composed 
in
  Unicode, the designers of such fonts are unlikely to do so, because 
they
 don't want to get caught up in creating and supporting non-standards,
 and because as far as they are concerned, any software worth its salt
 ought to support un-composed characters.

  So let me come back to the problems you say you're having using 
Unicode
 with non-pre-composed characters. Most Unicode-capable software,
  whether produced by Microsoft or not, should be capable of dealing 
with
 non-composed characters. Is there some specific software that doesn't
 deal with non-composed characters that you need for linguistic work?
 (Besides Toolbox--I'll have to let someone else address how Toolbox
 works with this.)

 Mike Maxwell

  SPONSORED LINKS
    Science kits   Science education   Science kit for kid
   Science education supply   My first science kit

  --------
 YAHOO! GROUPS LINKS

  *  Visit your group "lexicographylist" on the web.

 *  To unsubscribe from this group, send an email to:
 lexicographylist-unsubscribe at yahoogroups.com

 *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

  --------

___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/