[Lexicog] Tone languages in Toolbox/Shoebox

Mon Apr 10 11:17:00 UTC 2006

Hi Chinedu,

> The first issue is in terms of individual standalone programs that
> should handle Igbo texts. I have experimented with different programs:
> Concordance, Corpus Presenter, Wordsmith, ToolBox, and Word Sketch
> Engine [WSE] (http://www.sketchengine.co.uk/) etc. Only ToolBox and
> WordSketch Engine can handle the subdotted vowels without problem; but
> a combination of subdotted vowels with the tone marks is simply an
> additional burden and most programs that profess to be Unicode based
> simply handle them as TWO separate characters. In other words, the
> problem STILL persists. WSE is server-based and is not freely
> available. In some programs (that profess to be Unicode based) one
> component or the other would give you question marks for the sub-dotted
> vowels, while another component of the same program would render the
> text as it should. Tone marks have not yet been added!

I assume you are referring to "combining characters" (or "combining 
diacritical marks") in Unicode (not to be confused with "precomposed 
characters" which are something else entirely) ... "combining 
characters" are special, separate, characters consisting only of the 
diacritic mark, that are supposed to be drawn by the font rendering 
system over/under the *preceding* character, e.g. U+0323 ("Combining 
Dot Below") after a vowel should cause the dot to be rendered under 
the vowel. In practice though I know they often don't combine 
correctly with the previous character (e.g. you may see "a^" instead 
of "â"). This is seldom the fault of the application however, it's 
usually a problem with the fonts themselves. Older fonts (and therein 
lies the problem: most fonts supplied with Windows are basically 
"older fonts" that haven't been properly updated in a long time) 
don't really "know about" combining characters, so even if the 
application does support Unicode, if an older font is being used to 
display the text it may render incorrectly anyway. In many cases the 
problem(s) can be solved by simply choosing a different font within 
the application (to the extent that the application allows you to do 
so). E.g. Arial Unicode MS, although not a terribly attractive font, 
is generally speaking a good bet standards-wise. Also try e.g. 
Genium.

Very broadly speaking, a few rules of thumb:
- If Unicode characters come out as ordinary question marks (and stay 
question marks afterwards, regardless of which font is used) that 
typically means the application doesn't support Unicode and the 
information about what character it was has been "lost".
- If you see hollow squares (or solid black squares/rectangles, or 
black diamonds with question marks), that usually means the 
application does support Unicode and the text is encoded correctly 
(meaning, the information about what character it is is retained even 
though you can't see the character), but the font does not contain 
the desired character (a far more "minor" problem).
- Likewise, if e.g. a combining "^" after an "a" simply comes out as 
"a^" instead of "â", that most likely means the application *does* 
correctly support Unicode and that the character is encoded 
correctly, but that a 'smarter' font needs to be selected.

> The second issue is connected with the internet. Netscape, Opera,
> Firefox can all display Unicode texts without tone marks. But a
> subdotted vowel that is combined with a tone mark is displayed as TWO
> separate characters. That is why some Africans prefer to present texts
> of their language that involve such combined characters either as pdf
> or images(screenshots).

Likewise, this is a problem with the fonts themselves, not with 
Firefox/Netscape/Opera etc. In fact, apart from giving you (and the 
web page designers) options to display text using a different font, 
there is basically nothing further that *could* be done by the 
developers of those applications even if they wanted to, because an 
application can't actually "tell" if a selected font doesn't support 
combining characters properly; the application has "done all it can", 
so to speak. (IMO Firefox actually already goes pretty far in trying 
to get things to display correctly, e.g. checking if any given 
character has a representation in the current font and choosing a 
different font for that character if it does not, unlike Internet 
Explorer which just displays hollow squares). (I do think it is a 
problem that Microsoft 'in this day and age' doesn't try to 
distribute a decent general Unicode font with Windows, but let's hope 
they make an effort with the upcoming Vista.)

> Each time I ask around or try to find out a lasting solution, I always
> get the same answer: the solution lies with the font designer; he
> should produce pre-composed characters, for example: " vowel + accent
> + a dot under the vowels".
> Is this answer wrong? Or has a Unicode based solution already been
> found?

It's partially correct in that the solution lies with the font 
designers, but it's not correct that they should produce pre-composed 
characters (as these will be non-standard). What they should do is 
design the fonts to 'know how to' correctly combine the special 
"combining characters" with the preceding characters. (This is the 
"correct" solution, i.e. to stick to using the standard "combining 
characters" - it is Unicode-based and already supported by Unicode 
and basically by all modern font rendering *systems*.)

In my view, it's very important to stick to using Unicode standard 
characters / combining characters. If some combined characters don't 
display properly in certain circumstances internally, then so long as 
the output/products you produce look correct (depending on how you 
produce and distribute the results), and it doesn't impede editing, 
then it's really only a "cosmetic issue" at that point. Using the 
Unicode private use area to create new (non-standard) precomposed 
characters is almost like creating your own proprietary font (= not a 
good idea). It's still 'early days' with Unicode, relatively 
speaking, so there are still a lot of 'teething problems' especially 
with decent font support and so on, but I think one should stick it 
through; within the next several years we are going to be seeing more 
and more (and better) fonts that correctly support e.g. combining 
characters, and so long as you have stuck to the existing standards 
then basically the most you'll have to do is 'select a different 
font'. However, if you've encoded your data with custom 'private use 
area' pre-combined characters, you'll be stuck with the proprietary 
fonts you designed. You'll also not be able to properly interchange 
(e.g. copy and paste) data with other Unicode apps (e.g. Word, 
OpenOffice etc.) without doing custom conversions either.

(If typesetting the Unicode stuff is problematic, I still think it's 
generally better to stick to Unicode for working with and storing 
your data, and only do a bit of custom non-standard munging of the 
data and/or fonts for the typesetting process if you have to.)

 - David

---
http://tshwanedje.com/
TshwaneDJe Human Language Technology

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/