[Lexicog] Tone languages in Toolbox/Shoebox

Mon Apr 10 12:26:04 UTC 2006

Hi David,
Nice to read from you.

Yes, I am referring to "combining characters" (or "combining 
diacritical marks") in Unicode, but with the understanding that a
program should handle the combined characters as ONE whole (and not 
process them as two or three disparate symbols). I think that 
processing the combined characters as one whole should be in the same 
way a program handles a pre-composed character as ONE whole. It is only 
when the characters are processed as such that one can easily do all 
sorts of text manipulation that the linguist usually enjoys.

You explained that font designers should:
" design the fonts to 'know how to' correctly combine the special 
"combining characters" with the preceding characters. (This is the 
"correct" solution, i.e. to stick to using the standard "combining 
characters" - it is Unicode-based and already supported by Unicode and 
basically by all modern font rendering *systems*.)"

I do agree to sticking to established standards and also contributing 
to the development and adoption of the standard. But as things now 
stand, we would continue to have the problem mentioned by the initial 
poster that started this discussion. The Unicode based solution still 
needs to be explored the way you suggested; this has not yet been done. 
But it would surely continue to influence (or even determine) the 
availability of resources for working on the languages that need such 
enhancements.

Questions: is it not possible for those engaged in the development of 
Doulos or Gentium to do exactly what you mentioned?  [I am not a font 
designer!!]

Chinedu Uchechukwu

-----Original Message-----
From: David Joffe <david.joffe at tshwanedje.com>
To: lexicographylist at yahoogroups.com
Sent: Mon, 10 Apr 2006 13:17:00 +0200
Subject: Re: [Lexicog]  Tone languages in Toolbox/Shoebox

   Hi Chinedu,

 > The first issue is in terms of individual standalone programs that
  > should handle Igbo texts. I have experimented with different 
programs:
 > Concordance, Corpus Presenter, Wordsmith, ToolBox, and Word Sketch
 > Engine [WSE] (http://www.sketchengine.co.uk/) etc. Only ToolBox and
  > WordSketch Engine can handle the subdotted vowels without problem; 
but
 > a combination of subdotted vowels with the tone marks is simply an
 > additional burden and most programs that profess to be Unicode based
 > simply handle them as TWO separate characters. In other words, the
 > problem STILL persists. WSE is server-based and is not freely
 > available. In some programs (that profess to be Unicode based) one
  > component or the other would give you question marks for the 
sub-dotted
 > vowels, while another component of the same program would render the
 > text as it should. Tone marks have not yet been added!

 I assume you are referring to "combining characters" (or "combining
 diacritical marks") in Unicode (not to be confused with "precomposed
 characters" which are something else entirely) ... "combining
 characters" are special, separate, characters consisting only of the
 diacritic mark, that are supposed to be drawn by the font rendering
 system over/under the *preceding* character, e.g. U+0323 ("Combining
 Dot Below") after a vowel should cause the dot to be rendered under
 the vowel. In practice though I know they often don't combine
 correctly with the previous character (e.g. you may see "a^" instead
 of "â"). This is seldom the fault of the application however, it's
 usually a problem with the fonts themselves. Older fonts (and therein
 lies the problem: most fonts supplied with Windows are basically
 "older fonts" that haven't been properly updated in a long time)
 don't really "know about" combining characters, so even if the
 application does support Unicode, if an older font is being used to
 display the text it may render incorrectly anyway. In many cases the
 problem(s) can be solved by simply choosing a different font within
 the application (to the extent that the application allows you to do
 so). E.g. Arial Unicode MS, although not a terribly attractive font,
 is generally speaking a good bet standards-wise. Also try e.g.
 Genium.

 Very broadly speaking, a few rules of thumb:
 - If Unicode characters come out as ordinary question marks (and stay
 question marks afterwards, regardless of which font is used) that
 typically means the application doesn't support Unicode and the
 information about what character it was has been "lost".
 - If you see hollow squares (or solid black squares/rectangles, or
 black diamonds with question marks), that usually means the
 application does support Unicode and the text is encoded correctly
 (meaning, the information about what character it is is retained even
 though you can't see the character), but the font does not contain
 the desired character (a far more "minor" problem).
 - Likewise, if e.g. a combining "^" after an "a" simply comes out as
 "a^" instead of "â", that most likely means the application *does*
 correctly support Unicode and that the character is encoded
 correctly, but that a 'smarter' font needs to be selected.

 > The second issue is connected with the internet. Netscape, Opera,
 > Firefox can all display Unicode texts without tone marks. But a
 > subdotted vowel that is combined with a tone mark is displayed as TWO
  > separate characters. That is why some Africans prefer to present 
texts
 > of their language that involve such combined characters either as pdf
 > or images(screenshots).

 Likewise, this is a problem with the fonts themselves, not with
 Firefox/Netscape/Opera etc. In fact, apart from giving you (and the
 web page designers) options to display text using a different font,
 there is basically nothing further that *could* be done by the
 developers of those applications even if they wanted to, because an
 application can't actually "tell" if a selected font doesn't support
 combining characters properly; the application has "done all it can",
 so to speak. (IMO Firefox actually already goes pretty far in trying
 to get things to display correctly, e.g. checking if any given
 character has a representation in the current font and choosing a
 different font for that character if it does not, unlike Internet
 Explorer which just displays hollow squares). (I do think it is a
 problem that Microsoft 'in this day and age' doesn't try to
 distribute a decent general Unicode font with Windows, but let's hope
 they make an effort with the upcoming Vista.)

  > Each time I ask around or try to find out a lasting solution, I 
always
 > get the same answer: the solution lies with the font designer; he
 > should produce pre-composed characters, for example: " vowel + accent
 > + a dot under the vowels".
 > Is this answer wrong? Or has a Unicode based solution already been
 > found?

 It's partially correct in that the solution lies with the font
 designers, but it's not correct that they should produce pre-composed
 characters (as these will be non-standard). What they should do is
 design the fonts to 'know how to' correctly combine the special
 "combining characters" with the preceding characters. (This is the
 "correct" solution, i.e. to stick to using the standard "combining
 characters" - it is Unicode-based and already supported by Unicode
 and basically by all modern font rendering *systems*.)

 In my view, it's very important to stick to using Unicode standard
 characters / combining characters. If some combined characters don't
 display properly in certain circumstances internally, then so long as
 the output/products you produce look correct (depending on how you
 produce and distribute the results), and it doesn't impede editing,
 then it's really only a "cosmetic issue" at that point. Using the
 Unicode private use area to create new (non-standard) precomposed
 characters is almost like creating your own proprietary font (= not a
 good idea). It's still 'early days' with Unicode, relatively
 speaking, so there are still a lot of 'teething problems' especially
 with decent font support and so on, but I think one should stick it
 through; within the next several years we are going to be seeing more
 and more (and better) fonts that correctly support e.g. combining
 characters, and so long as you have stuck to the existing standards
 then basically the most you'll have to do is 'select a different
 font'. However, if you've encoded your data with custom 'private use
 area' pre-combined characters, you'll be stuck with the proprietary
 fonts you designed. You'll also not be able to properly interchange
 (e.g. copy and paste) data with other Unicode apps (e.g. Word,
 OpenOffice etc.) without doing custom conversions either.

 (If typesetting the Unicode stuff is problematic, I still think it's
 generally better to stick to Unicode for working with and storing
 your data, and only do a bit of custom non-standard munging of the
 data and/or fonts for the typesetting process if you have to.)

 - David

 ---
 http://tshwanedje.com/
 TshwaneDJe Human Language Technology

  SPONSORED LINKS
    Science kits   Science education   Science kit for kid
   Cognitive science   Science education supply   My first science kit

  --------
 YAHOO! GROUPS LINKS

  *  Visit your group "lexicographylist" on the web.

 *  To unsubscribe from this group, send an email to:
 lexicographylist-unsubscribe at yahoogroups.com

 *  Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.

  --------

___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/