[Lexicog] Tone languages in Toolbox/Shoebox

neduchi at NETSCAPE.NET neduchi at NETSCAPE.NET
Wed Apr 12 10:27:03 UTC 2006


Hi Mike,
I am happy you've noticed my point. I myself have actually been looking 
for a more permanent solution before now; and the solution (in my view) 
should also address at least one of the points raised by the initial 
posters: making wordlists and phono-morphological analysis. My 
conclusions so far:

(1) Pre-Composed Characters: you have come back to my initial comment 
by saying that:

>>Some of these problems would be solved by using pre-composed chars.<<

(2) Sub-dotted Vowels: you explained that

>>The dot under problem is more difficult, because there are few
>>pre-composed dot-under characters (maybe none, I can't remember), and
>>certainly no pre-composed characters having both the dot under and an
>>acute or grave.

I could not find much in the Unicode char-set. But if you look at the 
"Index For Unicode HTML References", 
(http://home.att.net/~jameskass/UNI00001.HTM),  especially the row that 
starts with the number: 471 []01D7], you can see that 471 is made up of 
 (1) a vowel, (2) an Umlaut, (3) and a tone mark ON TOP of the umlaut.

The row in question involves ""combined characters" (or "combed 
diacritical marks") in Unicode, which is the kind of solution that 
should solve the problem that African languages like Igbo and Yoruba 
are presently having. But if you try to realize these characters WITHIN 
Unicode in such a manner (for Igbo or Yoruba, for example), you would 
be going "against the Standards", as David and Greg explained. Instead, 
you have to combine different Unicode points and be aware of ALL the 
IF's that Greg mention at the bottom of his message 
[http://groups.yahoo.com/group/lexicographylist/message/3021].
I think that the pre-composed character solution is simpler and should 
also be of immense help to a lot of people working on African 
languages.  Am I wrong?

(3) The Initial Poster:
The relevant portion of their mail:
<<2. How to make Toolbox disambiguate between for example an "a" with 
high tone, an "a" with a mid and an "a" with a low <<tone. So far we 
can make it "look" right by adding diacritics as secondary characters 
but this does not help for searching, <<glossing, ... since Toolbox 
seems to ignore secondary characters in for example searching.

I think that this also applies to the sub-dotted vowels when combined 
with high, low, and downstepped tones on top. For linguistic works 
(dictionary making, syntactic or phono-morphological analysis) it is 
the tone-marked texts that are used (at least for Igbo and Yoruba and 
other African languages with similar scripts), although texts for 
native speakers are not usually tone marked. Having the characters as 
pre-composed characters would be an immense help to a lot of 
researchers; it would make all sorts of searching (+ regular 
expression), and making word lists easier. Exchange of data and 
internet searching of tone marked texts of the languages would also be 
enhanced; instead of the present situation of making only pdf or 
screenshots of documents that others must read through to find what 
they are looking for. I think the pre-composed character solution would 
also lighten the burden for the initial posters, unless they are of a 
different opinion.

May be the font experts here can raise this issue at the Unicode 
conference [http://www.unicode.org/press/pr-iuc30.html], since it 
belongs to the session on: Making scripts and languages accessible

It is my frustration that I am letting out!! In the hope that someone 
might hear and do something.

But I am still interested in the solution found by the initial posters.

Best regards,

Chinedu Uchechukwu


-----Original Message-----
From: Mike Maxwell <maxwell at ldc.upenn.edu>
To: lexicographylist at yahoogroups.com
Sent: Tue, 11 Apr 2006 15:29:56 -0400
Subject: Re: [Lexicog]  Tone languages in Toolbox/Shoebox

   neduchi at netscape.net wrote:
  >> In what sense do these programs handle the dotted vowel + tone mark 
as
  >> two characters? Are they displaying the tone marks to the right of 
the
 >> vowel, or is the problem s.t. more subtle than this?
 >
 >
  > Please have a look at the link below. It is a sample of an 
arbitrarilly
 > tone-marked Igbo text I put together with Andrew Cunningham. You can
 > switch between the three different fonts used: Arial Unicode MS,
  > CODE2000 and Doulos SIL. Try out the fonts and observe the location 
of
 > the sub-dots and the tone marks:
 > http://www.openroad.net.au/languages/african/igbo/sample.html
 >
 > I would like to see the sub-dotted and tone-marked characters
 > 'compactly' displayed with the tone marks as ONE composite whole and
 > not as two or three estranged neighbours.

 OK, now I'm beginning to understand the problem you're seeing!

 Yes, this appears to be a rendering problem, not a Unicode problem per
  se. That is to say, either there's a problem with the font, or with 
the
 technology that displays the font (I'm not sure which).

 Let me summarize the rendering issues I see, and let me know if I'm
 missing s.t.

  First, the accent is much too low over upper case vowels. It's also 
too
  far to the left over the lower and upper case 'i/I' (these appear in 
the
 sample paragraph, but not in the list of sample characters). Also, the
 dot under the upper case 'U' is too far to the right (both in the
 undotted U in the para, and the dotted U in the sample chars), and the
 dot under the lower case 'i' is much too far to the left (in fact,
 almost under the preceding letter).

 Also, the upper case N with grave (U+01F8) shows up as a box in many
 apps (it looks OK in Firefox).

 (I also see a dot _over_ n/N in the sample chars--is that correct?)

 Some of these problems would be solved by using pre-composed chars.
 (That is, many of the chars in the sample para appear to be in NFD
 normlization, rather than NFC.) For example, the grave vowels without
 dots would probably look just fine if they used the pre-composed
  equivalents. (If you are going to use a decomposed character, the 
grave
 accented 'i' should probably be produced with the dotless-i, U+0131.
 This unfortunately doesn't solve the problem of the grave accent being
 too far to the left.)

 The dot under problem is more difficult, because there are few
 pre-composed dot-under characters (maybe none, I can't remember), and
 certainly no pre-composed characters having both the dot under and an
 acute or grave. But the fact that the dots on these characters don't
  show up in the right position is a font/rendering issue, which 
hopefully
 will get fixed. FWIW, the problem is noted at the wikipedia page
  
(http://en.wikipedia.org/wiki/UniCode#Ready-made_versus_composite_charact
ers).
 Of course that's no help right now...

  In sum, this appears to me to be a rendering issue, not a Unicode 
issue
 per se. It also appears to be a somewhat different question than the
 original posters brought up, who I believe were asking for tools to do
 phonology and/or morphology.

 Mike Maxwell


  SPONSORED LINKS
    Science kits   Science education   Science kit for kid
   Cognitive science   Science education supply   My first science kit

  --------
 YAHOO! GROUPS LINKS

  *  Visit your group "lexicographylist" on the web.

 *  To unsubscribe from this group, send an email to:
 lexicographylist-unsubscribe at yahoogroups.com

 *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

  --------




___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list