[Lexicog] Tone languages in Toolbox/Shoebox

neduchi at NETSCAPE.NET neduchi at NETSCAPE.NET
Tue Apr 11 09:58:15 UTC 2006


>In what sense do these programs handle the dotted vowel + tone mark as
>two characters? Are they displaying the tone marks to the right of the
>vowel, or is the problem s.t. more subtle than this?


Please have a look at the link below. It is a sample of an arbitrarilly 
tone-marked Igbo text I put together with Andrew Cunningham. You can 
switch between the three different fonts used:  Arial Unicode MS, 
CODE2000 and Doulos SIL. Try out the fonts and observe the location of 
the sub-dots and the tone marks:
http://www.openroad.net.au/languages/african/igbo/sample.html

I would like to see the sub-dotted and tone-marked characters 
'compactly' displayed with the tone marks as ONE composite whole and 
not as two or three estranged neighbours.

As I have been made to understand (and as David has also further 
confirmed in his last mail), the problem lies with the font 
(developers):
"What they should do is  design the fonts to 'know how to' correctly 
combine the special  "combining characters" with the preceding 
characters. "  
[http://groups.yahoo.com/group/lexicographylist/message/3016]

I prefer such a solution, because it should then work well with any 
Unicode-aware software. It surely would take care of the 
sorting/searching issues also raised by the initial poster. But I do 
NOT know how to achieve it!

To Sophie and Stuart, I look forward to seeing your solutions to the 
problem.
Chinedu Uchechukwu


-----Original Message-----
From: Mike Maxwell <maxwell at ldc.upenn.edu>
To: lexicographylist at yahoogroups.com
Sent: Mon, 10 Apr 2006 20:04:30 -0400
Subject: Re: [Lexicog]  Tone languages in Toolbox/Shoebox

   neduchi at netscape.net wrote:
 > Maxwell,
 > You asked: "Or are you needing to represent these vowel + accent
 > together with a dot under the vowels (as in Yoruba)?"
 >
 > Yes indeed, that is how it is needed for linguistic works;

  In that case, I happen to know that these can be represented in 
Unicode
 in Toolbox, using the Arial Unicode font from Microsoft. We did this
 for Yoruba (although I think we only tested it with the mid vowels,
 since those are IIRC the only dotted vowels Yoruba has).

  The dotted accented vowels cannot be represented as single 
pre-composed
 characters, but as I said in my previous msg, that should not be
 necessary. It happens that the partially composed forms, in which the
 dot + vowel are composed and the accent is added afterwards--or was it
 vice versa?--don't look very nice, because the accent (or dot, if it's
  vice versa) doesn't center over the composite character for some 
reason,
 presumably because the composite character has an incorrect width in
 Arial Unicode. But the completely decomposed characters (plain vowel +
 dot + accent) appear just fine.

 > ...and (if I
  > understand the initial poster correctly) that is also the problem 
the
 > initial poster is having.

  I looked at the original msg, and I'm not sure whether their problems 
in
 fact had much to do with Unicode and tone. Rather, the posters wanted
 to define phonological rules that would attach tone to tone bearing
 units. I may be mistaken, but I don't think Toolbox has the capability
 of applying phonological rules, tonal or otherwise. For that, you need
 a more sophisticated program. Andy Black replied to their msg with
 information about his tone parsing program, which has some built-in
 smarts about tonal phonology. Another program that could be used is
 Xerox's finite state toolkit, available on a CD included with the book
 by Ken Beesley and Lauri Karttunen, published by U of Chicago Press.
 (The version on the CD doesn't handle Unicode, but Lauri can provide
 licensed users with a later version of the software that does handle
  Unicode.) The Xerox toolkit requires considerable sophistication to 
use
 "out of the box", but it will do almost any phonological or
 morphological task you ask of it, if you ask nicely :-).

 As for the more general problem of composed characters: any
  linguistically aware program needs to be able to deal with 
'characters'
 (or phonemes represented as characters) that may contain more than one
 character, wherever that makes a difference. I'll take an example from
 English, realizing that English has such a horrible orthography that
 it's difficult to make any linguistic point. But we all read it, so:
 'ch' and 'sh' each represent (in most words) single phonemes (an
 affricate and a sibilant respectively).

 Now where would this make a difference? Well, suppose you wanted to
  treat 'ch' and 'sh' as distinct letters in the alphabet for purposes 
of
 sorting, i.e. the alphabet looked like
 a b c ch d e... s sh t...
 Then a linguistically aware program should be able to handle sorting
 like that. A linguistically aware program *should* also be able to
 handle this situation in searching. That is, if you search for all
  words with the letter 'c', the search should *not* return words with 
the
 letter 'ch' (unless they also contain the letter 'c', of course).

 Any linguistic program has to be able to deal with the fact that these
  digraphs represent single characters in some sense. So I'm not sure 
why
  the decomposed representation of a vowel + accent should cause a 
problem
  for sorting; you just need to tell the program what to do (sort 
digraphs
 with monographs, and accented vowels together with unaccented, or not).

 At the same time, you should be able to ignore certain things for
 searching, although this is more likely to show up with tone than with
 digraphs like 'ch' and 'sh'. That is, you should be able to search for
 words containing 'a' regardless of whether there is an accent, if you
 want to do so. I'll have to leave it to someone more familiar with
  Toolbox than me to answer that, although I'd be very surprised if 
there
 weren't some way to do this.

 > ..a combination of subdotted vowels with the tone marks is simply an
 > additional burden and most programs that profess to be Unicode based
 > simply handle them as TWO separate characters...
 >
 > The second issue is connected with the internet. Netscape, Opera,
 > Firefox can all display Unicode texts without tone marks. But a
  > subdotted vowel that is combined with a tone mark is displayed as 
TWO
 > separate characters.

 In what sense do these programs handle the dotted vowel + tone mark as
 two characters? Are they displaying the tone marks to the right of the
 vowel, or is the problem s.t. more subtle than this?

 Mike Maxwell

  --------
 YAHOO! GROUPS LINKS

  *  Visit your group "lexicographylist" on the web.

 *  To unsubscribe from this group, send an email to:
 lexicographylist-unsubscribe at yahoogroups.com

 *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

  --------




___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list