[Lexicog] Tone languages in Toolbox/Shoebox

Mike Maxwell maxwell at LDC.UPENN.EDU
Thu Apr 13 01:27:59 UTC 2006


neduchi at netscape.net wrote:
> I am happy you've noticed my point. I myself have actually been looking 
> for a more permanent solution before now; and the solution (in my view) 
> should also address at least one of the points raised by the initial 
> posters: making wordlists and phono-morphological analysis. 

I believe--to the extent that I understand the issues being raised--that 
_display_ (what the characters look like, e.g. whether accents or dots 
appear centered under the base char), and _behavior_ with respetct to 
programs (whether you can sort or search using composed chars) are two 
different issues.  I agree with you that there is a display issue, at 
least with some of the fonts (other responders noted that the SIL Doulos 
font seemed to display these just fine).  I am not convinced that there 
is a behavior issue, at least for Unicode-aware programs.

> My conclusions so far:
> 
> (1) Pre-Composed Characters: you have come back to my initial comment 
> by saying that:
> 
>>> Some of these problems would be solved by using pre-composed chars.<<

I was referring to the NDF (decomposed) forms that you were using, in 
places where an NCF (composed) form was available.  Namely, the accented 
vowels that do _not_ have underdots; you wrote them in decomposed forms, 
but there are pre-composed forms for these.  I am not advocating that 
more pre-composed characters be added to Unicode, and indeed it would be 
virtually impossible to add all the pre-composed characters that someone 
might want.  And it _should_ be unnecessary, although I grant that the 
current state of the art for display leaves something to be desired.

> (2) Sub-dotted Vowels: you explained that
> 
>>> The dot under problem is more difficult, because there are few
>>> pre-composed dot-under characters (maybe none, I can't remember), and
>>> certainly no pre-composed characters having both the dot under and an
>>> acute or grave.
> 
> I could not find much in the Unicode char-set. But if you look at the 
> "Index For Unicode HTML References", 
> (http://home.att.net/~jameskass/UNI00001.HTM),  especially the row that 
> starts with the number: 471 []01D7], you can see that 471 is made up of 
>  (1) a vowel, (2) an Umlaut, (3) and a tone mark ON TOP of the umlaut.
 >
 > The row in question involves ""combined characters" (or "combed
 > diacritical marks") in Unicode, which is the kind of solution that
 > should solve the problem that African languages like Igbo and Yoruba
 > are presently having.

I _believe_ that chars like U+01D7 were included in the initial versions 
of Unicode for consistency with pre-existing (probably ISO) encoding 
standards.  That option is now closed, as far as I understand, i.e. no 
new pre-composed chars are being created (all previously existing 
standards having been accounted for).

> I think that the pre-composed character solution is simpler and should 
> also be of immense help to a lot of people working on African 
> languages.  Am I wrong?

If Igbo (and other African languages that use the dot under vowels) were 
the only case, I suppose this would be simple.  But they are not the 
only case, and there is doubtless a nearly unlimited set of composite 
chars that people would require.  Instead, the solution is to make the 
use of non-pre-composed chars work right.

> (3) The Initial Poster:
> The relevant portion of their mail:
> <<2. How to make Toolbox disambiguate between for example an "a" with 
> high tone, an "a" with a mid and an "a" with a low <<tone. So far we 
> can make it "look" right by adding diacritics as secondary characters 
> but this does not help for searching, <<glossing, ... since Toolbox 
> seems to ignore secondary characters in for example searching.

Someone who knows Toolbox needs to respond to this.  I find it very 
unlikely that TB ignores "secondary chars" (= diacritics) when searching 
(although it might have a way of ignoring them when you want to, e.g. if 
you _want_ to search for all instances of 'a' regardless of tone).  If 
it does always ignore diacritics, this is a bug with TB, which I'm sure 
the programmers of TB would be happy to fix.

> ...Having the characters as 
> pre-composed characters would be an immense help to a lot of 
> researchers; it would make all sorts of searching (+ regular 
> expression)...

Regular expressions with non-pre-composed chars already work in several 
programming languages, such as Perl and Python.  I believe there are 
standards (published algorithms) for this.  A Unicode-aware program 
which allows regular expression searches but does not treat 
non-pre-composed chars correctly is simply broken, and needs to be fixed.

 > ...and making word lists easier...

I'm not sure why having pre-composed chars would make the creation of 
word lists easier.  Again, there are solutions to editing with 
non-pre-composed chars in Unicode (yudit is one such editor, but these 
days many editors can edit and display Unicode).

> Exchange of data ...

There is an issue with normalization, particularly the case of 
decomposed and pre-composed representations which (are supposed to) look 
the same, but which of course have different internal constituencies. 
But adding more pre-composed chars makes this worse, not better, because 
there are more cases to convert to either pre-composed or decomposed 
form, depending on which way the receiving program wants to work. 
(Actually, this is not a big problem, so in itself it is not an argument 
against adding more pre-composed forms.)

> internet searching of tone marked texts of the languages would also be 
> enhanced; instead of the present situation of making only pdf or 
> screenshots of documents that others must read through to find what 
> they are looking for. 

There are many Unicode pages out there which do not use PDF, and I can 
search them with the appropriate tools.

> But I am still interested in the solution found by the initial posters.

I _believe_ that (most of) the problems the initial posters raised are 
of a phonological nature (how to write phonological rules to deal with 
tone), things which are not solved by Unicode (or any other character 
encoding system).  There are tools for that (like the Xerox Finite State 
Toolkit; and I wrote one ten years ago, Hermit Crab, but unfortunately 
that was in the pre-Unicode days).

But I should let the original posters say if anyone gave them ideas 
off-line.


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list