[Lexicog] Tone languages in Toolbox/Shoebox
Mike Maxwell
maxwell at LDC.UPENN.EDU
Tue Apr 11 00:04:30 UTC 2006
neduchi at netscape.net wrote:
> Maxwell,
> You asked: "Or are you needing to represent these vowel + accent
> together with a dot under the vowels (as in Yoruba)?"
>
> Yes indeed, that is how it is needed for linguistic works;
In that case, I happen to know that these can be represented in Unicode
in Toolbox, using the Arial Unicode font from Microsoft. We did this
for Yoruba (although I think we only tested it with the mid vowels,
since those are IIRC the only dotted vowels Yoruba has).
The dotted accented vowels cannot be represented as single pre-composed
characters, but as I said in my previous msg, that should not be
necessary. It happens that the partially composed forms, in which the
dot + vowel are composed and the accent is added afterwards--or was it
vice versa?--don't look very nice, because the accent (or dot, if it's
vice versa) doesn't center over the composite character for some reason,
presumably because the composite character has an incorrect width in
Arial Unicode. But the completely decomposed characters (plain vowel +
dot + accent) appear just fine.
> ...and (if I
> understand the initial poster correctly) that is also the problem the
> initial poster is having.
I looked at the original msg, and I'm not sure whether their problems in
fact had much to do with Unicode and tone. Rather, the posters wanted
to define phonological rules that would attach tone to tone bearing
units. I may be mistaken, but I don't think Toolbox has the capability
of applying phonological rules, tonal or otherwise. For that, you need
a more sophisticated program. Andy Black replied to their msg with
information about his tone parsing program, which has some built-in
smarts about tonal phonology. Another program that could be used is
Xerox's finite state toolkit, available on a CD included with the book
by Ken Beesley and Lauri Karttunen, published by U of Chicago Press.
(The version on the CD doesn't handle Unicode, but Lauri can provide
licensed users with a later version of the software that does handle
Unicode.) The Xerox toolkit requires considerable sophistication to use
"out of the box", but it will do almost any phonological or
morphological task you ask of it, if you ask nicely :-).
As for the more general problem of composed characters: any
linguistically aware program needs to be able to deal with 'characters'
(or phonemes represented as characters) that may contain more than one
character, wherever that makes a difference. I'll take an example from
English, realizing that English has such a horrible orthography that
it's difficult to make any linguistic point. But we all read it, so:
'ch' and 'sh' each represent (in most words) single phonemes (an
affricate and a sibilant respectively).
Now where would this make a difference? Well, suppose you wanted to
treat 'ch' and 'sh' as distinct letters in the alphabet for purposes of
sorting, i.e. the alphabet looked like
a b c ch d e... s sh t...
Then a linguistically aware program should be able to handle sorting
like that. A linguistically aware program *should* also be able to
handle this situation in searching. That is, if you search for all
words with the letter 'c', the search should *not* return words with the
letter 'ch' (unless they also contain the letter 'c', of course).
Any linguistic program has to be able to deal with the fact that these
digraphs represent single characters in some sense. So I'm not sure why
the decomposed representation of a vowel + accent should cause a problem
for sorting; you just need to tell the program what to do (sort digraphs
with monographs, and accented vowels together with unaccented, or not).
At the same time, you should be able to ignore certain things for
searching, although this is more likely to show up with tone than with
digraphs like 'ch' and 'sh'. That is, you should be able to search for
words containing 'a' regardless of whether there is an accent, if you
want to do so. I'll have to leave it to someone more familiar with
Toolbox than me to answer that, although I'd be very surprised if there
weren't some way to do this.
> ..a combination of subdotted vowels with the tone marks is simply an
> additional burden and most programs that profess to be Unicode based
> simply handle them as TWO separate characters...
>
> The second issue is connected with the internet. Netscape, Opera,
> Firefox can all display Unicode texts without tone marks. But a
> subdotted vowel that is combined with a tone mark is displayed as TWO
> separate characters.
In what sense do these programs handle the dotted vowel + tone mark as
two characters? Are they displaying the tone marks to the right of the
vowel, or is the problem s.t. more subtle than this?
Mike Maxwell
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list