[Lexicog] Tone languages in Toolbox/Shoebox
neduchi at NETSCAPE.NET
neduchi at NETSCAPE.NET
Tue Apr 11 09:58:15 UTC 2006
>In what sense do these programs handle the dotted vowel + tone mark as
>two characters? Are they displaying the tone marks to the right of the
>vowel, or is the problem s.t. more subtle than this?
Please have a look at the link below. It is a sample of an arbitrarilly
tone-marked Igbo text I put together with Andrew Cunningham. You can
switch between the three different fonts used: Arial Unicode MS,
CODE2000 and Doulos SIL. Try out the fonts and observe the location of
the sub-dots and the tone marks:
http://www.openroad.net.au/languages/african/igbo/sample.html
I would like to see the sub-dotted and tone-marked characters
'compactly' displayed with the tone marks as ONE composite whole and
not as two or three estranged neighbours.
As I have been made to understand (and as David has also further
confirmed in his last mail), the problem lies with the font
(developers):
"What they should do is design the fonts to 'know how to' correctly
combine the special "combining characters" with the preceding
characters. "
[http://groups.yahoo.com/group/lexicographylist/message/3016]
I prefer such a solution, because it should then work well with any
Unicode-aware software. It surely would take care of the
sorting/searching issues also raised by the initial poster. But I do
NOT know how to achieve it!
To Sophie and Stuart, I look forward to seeing your solutions to the
problem.
Chinedu Uchechukwu
-----Original Message-----
From: Mike Maxwell <maxwell at ldc.upenn.edu>
To: lexicographylist at yahoogroups.com
Sent: Mon, 10 Apr 2006 20:04:30 -0400
Subject: Re: [Lexicog] Tone languages in Toolbox/Shoebox
neduchi at netscape.net wrote:
> Maxwell,
> You asked: "Or are you needing to represent these vowel + accent
> together with a dot under the vowels (as in Yoruba)?"
>
> Yes indeed, that is how it is needed for linguistic works;
In that case, I happen to know that these can be represented in
Unicode
in Toolbox, using the Arial Unicode font from Microsoft. We did this
for Yoruba (although I think we only tested it with the mid vowels,
since those are IIRC the only dotted vowels Yoruba has).
The dotted accented vowels cannot be represented as single
pre-composed
characters, but as I said in my previous msg, that should not be
necessary. It happens that the partially composed forms, in which the
dot + vowel are composed and the accent is added afterwards--or was it
vice versa?--don't look very nice, because the accent (or dot, if it's
vice versa) doesn't center over the composite character for some
reason,
presumably because the composite character has an incorrect width in
Arial Unicode. But the completely decomposed characters (plain vowel +
dot + accent) appear just fine.
> ...and (if I
> understand the initial poster correctly) that is also the problem
the
> initial poster is having.
I looked at the original msg, and I'm not sure whether their problems
in
fact had much to do with Unicode and tone. Rather, the posters wanted
to define phonological rules that would attach tone to tone bearing
units. I may be mistaken, but I don't think Toolbox has the capability
of applying phonological rules, tonal or otherwise. For that, you need
a more sophisticated program. Andy Black replied to their msg with
information about his tone parsing program, which has some built-in
smarts about tonal phonology. Another program that could be used is
Xerox's finite state toolkit, available on a CD included with the book
by Ken Beesley and Lauri Karttunen, published by U of Chicago Press.
(The version on the CD doesn't handle Unicode, but Lauri can provide
licensed users with a later version of the software that does handle
Unicode.) The Xerox toolkit requires considerable sophistication to
use
"out of the box", but it will do almost any phonological or
morphological task you ask of it, if you ask nicely :-).
As for the more general problem of composed characters: any
linguistically aware program needs to be able to deal with
'characters'
(or phonemes represented as characters) that may contain more than one
character, wherever that makes a difference. I'll take an example from
English, realizing that English has such a horrible orthography that
it's difficult to make any linguistic point. But we all read it, so:
'ch' and 'sh' each represent (in most words) single phonemes (an
affricate and a sibilant respectively).
Now where would this make a difference? Well, suppose you wanted to
treat 'ch' and 'sh' as distinct letters in the alphabet for purposes
of
sorting, i.e. the alphabet looked like
a b c ch d e... s sh t...
Then a linguistically aware program should be able to handle sorting
like that. A linguistically aware program *should* also be able to
handle this situation in searching. That is, if you search for all
words with the letter 'c', the search should *not* return words with
the
letter 'ch' (unless they also contain the letter 'c', of course).
Any linguistic program has to be able to deal with the fact that these
digraphs represent single characters in some sense. So I'm not sure
why
the decomposed representation of a vowel + accent should cause a
problem
for sorting; you just need to tell the program what to do (sort
digraphs
with monographs, and accented vowels together with unaccented, or not).
At the same time, you should be able to ignore certain things for
searching, although this is more likely to show up with tone than with
digraphs like 'ch' and 'sh'. That is, you should be able to search for
words containing 'a' regardless of whether there is an accent, if you
want to do so. I'll have to leave it to someone more familiar with
Toolbox than me to answer that, although I'd be very surprised if
there
weren't some way to do this.
> ..a combination of subdotted vowels with the tone marks is simply an
> additional burden and most programs that profess to be Unicode based
> simply handle them as TWO separate characters...
>
> The second issue is connected with the internet. Netscape, Opera,
> Firefox can all display Unicode texts without tone marks. But a
> subdotted vowel that is combined with a tone mark is displayed as
TWO
> separate characters.
In what sense do these programs handle the dotted vowel + tone mark as
two characters? Are they displaying the tone marks to the right of the
vowel, or is the problem s.t. more subtle than this?
Mike Maxwell
--------
YAHOO! GROUPS LINKS
* Visit your group "lexicographylist" on the web.
* To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
* Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
--------
___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list