[Lexicog] Tone languages in Toolbox/Shoebox
Mike Maxwell
maxwell at LDC.UPENN.EDU
Thu Apr 13 01:27:59 UTC 2006
neduchi at netscape.net wrote:
> I am happy you've noticed my point. I myself have actually been looking
> for a more permanent solution before now; and the solution (in my view)
> should also address at least one of the points raised by the initial
> posters: making wordlists and phono-morphological analysis.
I believe--to the extent that I understand the issues being raised--that
_display_ (what the characters look like, e.g. whether accents or dots
appear centered under the base char), and _behavior_ with respetct to
programs (whether you can sort or search using composed chars) are two
different issues. I agree with you that there is a display issue, at
least with some of the fonts (other responders noted that the SIL Doulos
font seemed to display these just fine). I am not convinced that there
is a behavior issue, at least for Unicode-aware programs.
> My conclusions so far:
>
> (1) Pre-Composed Characters: you have come back to my initial comment
> by saying that:
>
>>> Some of these problems would be solved by using pre-composed chars.<<
I was referring to the NDF (decomposed) forms that you were using, in
places where an NCF (composed) form was available. Namely, the accented
vowels that do _not_ have underdots; you wrote them in decomposed forms,
but there are pre-composed forms for these. I am not advocating that
more pre-composed characters be added to Unicode, and indeed it would be
virtually impossible to add all the pre-composed characters that someone
might want. And it _should_ be unnecessary, although I grant that the
current state of the art for display leaves something to be desired.
> (2) Sub-dotted Vowels: you explained that
>
>>> The dot under problem is more difficult, because there are few
>>> pre-composed dot-under characters (maybe none, I can't remember), and
>>> certainly no pre-composed characters having both the dot under and an
>>> acute or grave.
>
> I could not find much in the Unicode char-set. But if you look at the
> "Index For Unicode HTML References",
> (http://home.att.net/~jameskass/UNI00001.HTM), especially the row that
> starts with the number: 471 []01D7], you can see that 471 is made up of
> (1) a vowel, (2) an Umlaut, (3) and a tone mark ON TOP of the umlaut.
>
> The row in question involves ""combined characters" (or "combed
> diacritical marks") in Unicode, which is the kind of solution that
> should solve the problem that African languages like Igbo and Yoruba
> are presently having.
I _believe_ that chars like U+01D7 were included in the initial versions
of Unicode for consistency with pre-existing (probably ISO) encoding
standards. That option is now closed, as far as I understand, i.e. no
new pre-composed chars are being created (all previously existing
standards having been accounted for).
> I think that the pre-composed character solution is simpler and should
> also be of immense help to a lot of people working on African
> languages. Am I wrong?
If Igbo (and other African languages that use the dot under vowels) were
the only case, I suppose this would be simple. But they are not the
only case, and there is doubtless a nearly unlimited set of composite
chars that people would require. Instead, the solution is to make the
use of non-pre-composed chars work right.
> (3) The Initial Poster:
> The relevant portion of their mail:
> <<2. How to make Toolbox disambiguate between for example an "a" with
> high tone, an "a" with a mid and an "a" with a low <<tone. So far we
> can make it "look" right by adding diacritics as secondary characters
> but this does not help for searching, <<glossing, ... since Toolbox
> seems to ignore secondary characters in for example searching.
Someone who knows Toolbox needs to respond to this. I find it very
unlikely that TB ignores "secondary chars" (= diacritics) when searching
(although it might have a way of ignoring them when you want to, e.g. if
you _want_ to search for all instances of 'a' regardless of tone). If
it does always ignore diacritics, this is a bug with TB, which I'm sure
the programmers of TB would be happy to fix.
> ...Having the characters as
> pre-composed characters would be an immense help to a lot of
> researchers; it would make all sorts of searching (+ regular
> expression)...
Regular expressions with non-pre-composed chars already work in several
programming languages, such as Perl and Python. I believe there are
standards (published algorithms) for this. A Unicode-aware program
which allows regular expression searches but does not treat
non-pre-composed chars correctly is simply broken, and needs to be fixed.
> ...and making word lists easier...
I'm not sure why having pre-composed chars would make the creation of
word lists easier. Again, there are solutions to editing with
non-pre-composed chars in Unicode (yudit is one such editor, but these
days many editors can edit and display Unicode).
> Exchange of data ...
There is an issue with normalization, particularly the case of
decomposed and pre-composed representations which (are supposed to) look
the same, but which of course have different internal constituencies.
But adding more pre-composed chars makes this worse, not better, because
there are more cases to convert to either pre-composed or decomposed
form, depending on which way the receiving program wants to work.
(Actually, this is not a big problem, so in itself it is not an argument
against adding more pre-composed forms.)
> internet searching of tone marked texts of the languages would also be
> enhanced; instead of the present situation of making only pdf or
> screenshots of documents that others must read through to find what
> they are looking for.
There are many Unicode pages out there which do not use PDF, and I can
search them with the appropriate tools.
> But I am still interested in the solution found by the initial posters.
I _believe_ that (most of) the problems the initial posters raised are
of a phonological nature (how to write phonological rules to deal with
tone), things which are not solved by Unicode (or any other character
encoding system). There are tools for that (like the Xerox Finite State
Toolkit; and I wrote one ten years ago, Hermit Crab, but unfortunately
that was in the pre-Unicode days).
But I should let the original posters say if anyone gave them ideas
off-line.
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list