[Lexicog] Tone languages in Toolbox/Shoebox

Mike Maxwell maxwell at LDC.UPENN.EDU
Tue Apr 11 00:04:30 UTC 2006


neduchi at netscape.net wrote:
> Maxwell,
> You asked: "Or are you needing to represent these  vowel + accent 
> together with a dot under the vowels (as in Yoruba)?"
> 
> Yes indeed, that is how it is needed for linguistic works; 

In that case, I happen to know that these can be represented in Unicode 
in Toolbox, using the Arial Unicode font from Microsoft.  We did this 
for Yoruba (although I think we only tested it with the mid vowels, 
since those are IIRC the only dotted vowels Yoruba has).

The dotted accented vowels cannot be represented as single pre-composed 
characters, but as I said in my previous msg, that should not be 
necessary.  It happens that the partially composed forms, in which the 
dot + vowel are composed and the accent is added afterwards--or was it 
vice versa?--don't look very nice, because the accent (or dot, if it's 
vice versa) doesn't center over the composite character for some reason, 
presumably because the composite character has an incorrect width in 
Arial Unicode.  But the completely decomposed characters (plain vowel + 
dot + accent) appear just fine.

> ...and (if I 
> understand the initial poster correctly) that is also the problem the 
> initial poster is having. 

I looked at the original msg, and I'm not sure whether their problems in 
fact had much to do with Unicode and tone.  Rather, the posters wanted 
to define phonological rules that would attach tone to tone bearing 
units.  I may be mistaken, but I don't think Toolbox has the capability 
of applying phonological rules, tonal or otherwise.  For that, you need 
a more sophisticated program.  Andy Black replied to their msg with 
information about his tone parsing program, which has some built-in 
smarts about tonal phonology.  Another program that could be used is 
Xerox's finite state toolkit, available on a CD included with the book 
by Ken Beesley and Lauri Karttunen, published by U of Chicago Press. 
(The version on the CD doesn't handle Unicode, but Lauri can provide 
licensed users with a later version of the software that does handle 
Unicode.)  The Xerox toolkit requires considerable sophistication to use 
"out of the box", but it will do almost any phonological or 
morphological task you ask of it, if you ask nicely :-).

As for the more general problem of composed characters: any 
linguistically aware program needs to be able to deal with 'characters' 
(or phonemes represented as characters) that may contain more than one 
character, wherever that makes a difference.  I'll take an example from 
English, realizing that English has such a horrible orthography that 
it's difficult to make any linguistic point.  But we all read it, so: 
'ch' and 'sh' each represent (in most words) single phonemes (an 
affricate and a sibilant respectively).

Now where would this make a difference?  Well, suppose you wanted to 
treat 'ch' and 'sh' as distinct letters in the alphabet for purposes of 
sorting, i.e. the alphabet looked like
   a b c ch d e... s sh t...
Then a linguistically aware program should be able to handle sorting 
like that.  A linguistically aware program *should* also be able to 
handle this situation in searching.  That is, if you search for all 
words with the letter 'c', the search should *not* return words with the 
  letter 'ch' (unless they also contain the letter 'c', of course).

Any linguistic program has to be able to deal with the fact that these 
digraphs represent single characters in some sense.  So I'm not sure why 
the decomposed representation of a vowel + accent should cause a problem 
for sorting; you just need to tell the program what to do (sort digraphs 
with monographs, and accented vowels together with unaccented, or not).

At the same time, you should be able to ignore certain things for 
searching, although this is more likely to show up with tone than with 
digraphs like 'ch' and 'sh'.  That is, you should be able to search for 
words containing 'a' regardless of whether there is an accent, if you 
want to do so.  I'll have to leave it to someone more familiar with 
Toolbox than me to answer that, although I'd be very surprised if there 
weren't some way to do this.

> ..a combination of subdotted vowels with the tone marks is simply an 
> additional burden and most programs that profess to be Unicode based 
> simply handle them as TWO separate characters...
> 
> The second issue is connected with the internet. Netscape, Opera, 
> Firefox can all display Unicode texts without tone marks. But a 
> subdotted vowel that is combined with a tone mark is displayed as TWO 
> separate characters. 

In what sense do these programs handle the dotted vowel + tone mark as 
two characters?  Are they displaying the tone marks to the right of the 
vowel, or is the problem s.t. more subtle than this?

    Mike Maxwell


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list