[Lexicog] Tone languages in Toolbox/Shoebox
neduchi at NETSCAPE.NET
neduchi at NETSCAPE.NET
Wed Apr 12 10:27:03 UTC 2006
Hi Mike,
I am happy you've noticed my point. I myself have actually been looking
for a more permanent solution before now; and the solution (in my view)
should also address at least one of the points raised by the initial
posters: making wordlists and phono-morphological analysis. My
conclusions so far:
(1) Pre-Composed Characters: you have come back to my initial comment
by saying that:
>>Some of these problems would be solved by using pre-composed chars.<<
(2) Sub-dotted Vowels: you explained that
>>The dot under problem is more difficult, because there are few
>>pre-composed dot-under characters (maybe none, I can't remember), and
>>certainly no pre-composed characters having both the dot under and an
>>acute or grave.
I could not find much in the Unicode char-set. But if you look at the
"Index For Unicode HTML References",
(http://home.att.net/~jameskass/UNI00001.HTM), especially the row that
starts with the number: 471 []01D7], you can see that 471 is made up of
(1) a vowel, (2) an Umlaut, (3) and a tone mark ON TOP of the umlaut.
The row in question involves ""combined characters" (or "combed
diacritical marks") in Unicode, which is the kind of solution that
should solve the problem that African languages like Igbo and Yoruba
are presently having. But if you try to realize these characters WITHIN
Unicode in such a manner (for Igbo or Yoruba, for example), you would
be going "against the Standards", as David and Greg explained. Instead,
you have to combine different Unicode points and be aware of ALL the
IF's that Greg mention at the bottom of his message
[http://groups.yahoo.com/group/lexicographylist/message/3021].
I think that the pre-composed character solution is simpler and should
also be of immense help to a lot of people working on African
languages. Am I wrong?
(3) The Initial Poster:
The relevant portion of their mail:
<<2. How to make Toolbox disambiguate between for example an "a" with
high tone, an "a" with a mid and an "a" with a low <<tone. So far we
can make it "look" right by adding diacritics as secondary characters
but this does not help for searching, <<glossing, ... since Toolbox
seems to ignore secondary characters in for example searching.
I think that this also applies to the sub-dotted vowels when combined
with high, low, and downstepped tones on top. For linguistic works
(dictionary making, syntactic or phono-morphological analysis) it is
the tone-marked texts that are used (at least for Igbo and Yoruba and
other African languages with similar scripts), although texts for
native speakers are not usually tone marked. Having the characters as
pre-composed characters would be an immense help to a lot of
researchers; it would make all sorts of searching (+ regular
expression), and making word lists easier. Exchange of data and
internet searching of tone marked texts of the languages would also be
enhanced; instead of the present situation of making only pdf or
screenshots of documents that others must read through to find what
they are looking for. I think the pre-composed character solution would
also lighten the burden for the initial posters, unless they are of a
different opinion.
May be the font experts here can raise this issue at the Unicode
conference [http://www.unicode.org/press/pr-iuc30.html], since it
belongs to the session on: Making scripts and languages accessible
It is my frustration that I am letting out!! In the hope that someone
might hear and do something.
But I am still interested in the solution found by the initial posters.
Best regards,
Chinedu Uchechukwu
-----Original Message-----
From: Mike Maxwell <maxwell at ldc.upenn.edu>
To: lexicographylist at yahoogroups.com
Sent: Tue, 11 Apr 2006 15:29:56 -0400
Subject: Re: [Lexicog] Tone languages in Toolbox/Shoebox
neduchi at netscape.net wrote:
>> In what sense do these programs handle the dotted vowel + tone mark
as
>> two characters? Are they displaying the tone marks to the right of
the
>> vowel, or is the problem s.t. more subtle than this?
>
>
> Please have a look at the link below. It is a sample of an
arbitrarilly
> tone-marked Igbo text I put together with Andrew Cunningham. You can
> switch between the three different fonts used: Arial Unicode MS,
> CODE2000 and Doulos SIL. Try out the fonts and observe the location
of
> the sub-dots and the tone marks:
> http://www.openroad.net.au/languages/african/igbo/sample.html
>
> I would like to see the sub-dotted and tone-marked characters
> 'compactly' displayed with the tone marks as ONE composite whole and
> not as two or three estranged neighbours.
OK, now I'm beginning to understand the problem you're seeing!
Yes, this appears to be a rendering problem, not a Unicode problem per
se. That is to say, either there's a problem with the font, or with
the
technology that displays the font (I'm not sure which).
Let me summarize the rendering issues I see, and let me know if I'm
missing s.t.
First, the accent is much too low over upper case vowels. It's also
too
far to the left over the lower and upper case 'i/I' (these appear in
the
sample paragraph, but not in the list of sample characters). Also, the
dot under the upper case 'U' is too far to the right (both in the
undotted U in the para, and the dotted U in the sample chars), and the
dot under the lower case 'i' is much too far to the left (in fact,
almost under the preceding letter).
Also, the upper case N with grave (U+01F8) shows up as a box in many
apps (it looks OK in Firefox).
(I also see a dot _over_ n/N in the sample chars--is that correct?)
Some of these problems would be solved by using pre-composed chars.
(That is, many of the chars in the sample para appear to be in NFD
normlization, rather than NFC.) For example, the grave vowels without
dots would probably look just fine if they used the pre-composed
equivalents. (If you are going to use a decomposed character, the
grave
accented 'i' should probably be produced with the dotless-i, U+0131.
This unfortunately doesn't solve the problem of the grave accent being
too far to the left.)
The dot under problem is more difficult, because there are few
pre-composed dot-under characters (maybe none, I can't remember), and
certainly no pre-composed characters having both the dot under and an
acute or grave. But the fact that the dots on these characters don't
show up in the right position is a font/rendering issue, which
hopefully
will get fixed. FWIW, the problem is noted at the wikipedia page
(http://en.wikipedia.org/wiki/UniCode#Ready-made_versus_composite_charact
ers).
Of course that's no help right now...
In sum, this appears to me to be a rendering issue, not a Unicode
issue
per se. It also appears to be a somewhat different question than the
original posters brought up, who I believe were asking for tools to do
phonology and/or morphology.
Mike Maxwell
SPONSORED LINKS
Science kits Science education Science kit for kid
Cognitive science Science education supply My first science kit
--------
YAHOO! GROUPS LINKS
* Visit your group "lexicographylist" on the web.
* To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
* Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
--------
___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list