[Lexicog] Tone languages in Toolbox/Shoebox
David Joffe
david.joffe at TSHWANEDJE.COM
Mon Apr 10 11:17:00 UTC 2006
Hi Chinedu,
> The first issue is in terms of individual standalone programs that
> should handle Igbo texts. I have experimented with different programs:
> Concordance, Corpus Presenter, Wordsmith, ToolBox, and Word Sketch
> Engine [WSE] (http://www.sketchengine.co.uk/) etc. Only ToolBox and
> WordSketch Engine can handle the subdotted vowels without problem; but
> a combination of subdotted vowels with the tone marks is simply an
> additional burden and most programs that profess to be Unicode based
> simply handle them as TWO separate characters. In other words, the
> problem STILL persists. WSE is server-based and is not freely
> available. In some programs (that profess to be Unicode based) one
> component or the other would give you question marks for the sub-dotted
> vowels, while another component of the same program would render the
> text as it should. Tone marks have not yet been added!
I assume you are referring to "combining characters" (or "combining
diacritical marks") in Unicode (not to be confused with "precomposed
characters" which are something else entirely) ... "combining
characters" are special, separate, characters consisting only of the
diacritic mark, that are supposed to be drawn by the font rendering
system over/under the *preceding* character, e.g. U+0323 ("Combining
Dot Below") after a vowel should cause the dot to be rendered under
the vowel. In practice though I know they often don't combine
correctly with the previous character (e.g. you may see "a^" instead
of "â"). This is seldom the fault of the application however, it's
usually a problem with the fonts themselves. Older fonts (and therein
lies the problem: most fonts supplied with Windows are basically
"older fonts" that haven't been properly updated in a long time)
don't really "know about" combining characters, so even if the
application does support Unicode, if an older font is being used to
display the text it may render incorrectly anyway. In many cases the
problem(s) can be solved by simply choosing a different font within
the application (to the extent that the application allows you to do
so). E.g. Arial Unicode MS, although not a terribly attractive font,
is generally speaking a good bet standards-wise. Also try e.g.
Genium.
Very broadly speaking, a few rules of thumb:
- If Unicode characters come out as ordinary question marks (and stay
question marks afterwards, regardless of which font is used) that
typically means the application doesn't support Unicode and the
information about what character it was has been "lost".
- If you see hollow squares (or solid black squares/rectangles, or
black diamonds with question marks), that usually means the
application does support Unicode and the text is encoded correctly
(meaning, the information about what character it is is retained even
though you can't see the character), but the font does not contain
the desired character (a far more "minor" problem).
- Likewise, if e.g. a combining "^" after an "a" simply comes out as
"a^" instead of "â", that most likely means the application *does*
correctly support Unicode and that the character is encoded
correctly, but that a 'smarter' font needs to be selected.
> The second issue is connected with the internet. Netscape, Opera,
> Firefox can all display Unicode texts without tone marks. But a
> subdotted vowel that is combined with a tone mark is displayed as TWO
> separate characters. That is why some Africans prefer to present texts
> of their language that involve such combined characters either as pdf
> or images(screenshots).
Likewise, this is a problem with the fonts themselves, not with
Firefox/Netscape/Opera etc. In fact, apart from giving you (and the
web page designers) options to display text using a different font,
there is basically nothing further that *could* be done by the
developers of those applications even if they wanted to, because an
application can't actually "tell" if a selected font doesn't support
combining characters properly; the application has "done all it can",
so to speak. (IMO Firefox actually already goes pretty far in trying
to get things to display correctly, e.g. checking if any given
character has a representation in the current font and choosing a
different font for that character if it does not, unlike Internet
Explorer which just displays hollow squares). (I do think it is a
problem that Microsoft 'in this day and age' doesn't try to
distribute a decent general Unicode font with Windows, but let's hope
they make an effort with the upcoming Vista.)
> Each time I ask around or try to find out a lasting solution, I always
> get the same answer: the solution lies with the font designer; he
> should produce pre-composed characters, for example: " vowel + accent
> + a dot under the vowels".
> Is this answer wrong? Or has a Unicode based solution already been
> found?
It's partially correct in that the solution lies with the font
designers, but it's not correct that they should produce pre-composed
characters (as these will be non-standard). What they should do is
design the fonts to 'know how to' correctly combine the special
"combining characters" with the preceding characters. (This is the
"correct" solution, i.e. to stick to using the standard "combining
characters" - it is Unicode-based and already supported by Unicode
and basically by all modern font rendering *systems*.)
In my view, it's very important to stick to using Unicode standard
characters / combining characters. If some combined characters don't
display properly in certain circumstances internally, then so long as
the output/products you produce look correct (depending on how you
produce and distribute the results), and it doesn't impede editing,
then it's really only a "cosmetic issue" at that point. Using the
Unicode private use area to create new (non-standard) precomposed
characters is almost like creating your own proprietary font (= not a
good idea). It's still 'early days' with Unicode, relatively
speaking, so there are still a lot of 'teething problems' especially
with decent font support and so on, but I think one should stick it
through; within the next several years we are going to be seeing more
and more (and better) fonts that correctly support e.g. combining
characters, and so long as you have stuck to the existing standards
then basically the most you'll have to do is 'select a different
font'. However, if you've encoded your data with custom 'private use
area' pre-combined characters, you'll be stuck with the proprietary
fonts you designed. You'll also not be able to properly interchange
(e.g. copy and paste) data with other Unicode apps (e.g. Word,
OpenOffice etc.) without doing custom conversions either.
(If typesetting the Unicode stuff is problematic, I still think it's
generally better to stick to Unicode for working with and storing
your data, and only do a bit of custom non-standard munging of the
data and/or fonts for the typesetting process if you have to.)
- David
---
http://tshwanedje.com/
TshwaneDJe Human Language Technology
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list