[Lexicog] Tone languages in Toolbox/Shoebox
neduchi at NETSCAPE.NET
neduchi at NETSCAPE.NET
Mon Apr 10 12:26:04 UTC 2006
Hi David,
Nice to read from you.
Yes, I am referring to "combining characters" (or "combining
diacritical marks") in Unicode, but with the understanding that a
program should handle the combined characters as ONE whole (and not
process them as two or three disparate symbols). I think that
processing the combined characters as one whole should be in the same
way a program handles a pre-composed character as ONE whole. It is only
when the characters are processed as such that one can easily do all
sorts of text manipulation that the linguist usually enjoys.
You explained that font designers should:
" design the fonts to 'know how to' correctly combine the special
"combining characters" with the preceding characters. (This is the
"correct" solution, i.e. to stick to using the standard "combining
characters" - it is Unicode-based and already supported by Unicode and
basically by all modern font rendering *systems*.)"
I do agree to sticking to established standards and also contributing
to the development and adoption of the standard. But as things now
stand, we would continue to have the problem mentioned by the initial
poster that started this discussion. The Unicode based solution still
needs to be explored the way you suggested; this has not yet been done.
But it would surely continue to influence (or even determine) the
availability of resources for working on the languages that need such
enhancements.
Questions: is it not possible for those engaged in the development of
Doulos or Gentium to do exactly what you mentioned? [I am not a font
designer!!]
Chinedu Uchechukwu
-----Original Message-----
From: David Joffe <david.joffe at tshwanedje.com>
To: lexicographylist at yahoogroups.com
Sent: Mon, 10 Apr 2006 13:17:00 +0200
Subject: Re: [Lexicog] Tone languages in Toolbox/Shoebox
Hi Chinedu,
> The first issue is in terms of individual standalone programs that
> should handle Igbo texts. I have experimented with different
programs:
> Concordance, Corpus Presenter, Wordsmith, ToolBox, and Word Sketch
> Engine [WSE] (http://www.sketchengine.co.uk/) etc. Only ToolBox and
> WordSketch Engine can handle the subdotted vowels without problem;
but
> a combination of subdotted vowels with the tone marks is simply an
> additional burden and most programs that profess to be Unicode based
> simply handle them as TWO separate characters. In other words, the
> problem STILL persists. WSE is server-based and is not freely
> available. In some programs (that profess to be Unicode based) one
> component or the other would give you question marks for the
sub-dotted
> vowels, while another component of the same program would render the
> text as it should. Tone marks have not yet been added!
I assume you are referring to "combining characters" (or "combining
diacritical marks") in Unicode (not to be confused with "precomposed
characters" which are something else entirely) ... "combining
characters" are special, separate, characters consisting only of the
diacritic mark, that are supposed to be drawn by the font rendering
system over/under the *preceding* character, e.g. U+0323 ("Combining
Dot Below") after a vowel should cause the dot to be rendered under
the vowel. In practice though I know they often don't combine
correctly with the previous character (e.g. you may see "a^" instead
of "â"). This is seldom the fault of the application however, it's
usually a problem with the fonts themselves. Older fonts (and therein
lies the problem: most fonts supplied with Windows are basically
"older fonts" that haven't been properly updated in a long time)
don't really "know about" combining characters, so even if the
application does support Unicode, if an older font is being used to
display the text it may render incorrectly anyway. In many cases the
problem(s) can be solved by simply choosing a different font within
the application (to the extent that the application allows you to do
so). E.g. Arial Unicode MS, although not a terribly attractive font,
is generally speaking a good bet standards-wise. Also try e.g.
Genium.
Very broadly speaking, a few rules of thumb:
- If Unicode characters come out as ordinary question marks (and stay
question marks afterwards, regardless of which font is used) that
typically means the application doesn't support Unicode and the
information about what character it was has been "lost".
- If you see hollow squares (or solid black squares/rectangles, or
black diamonds with question marks), that usually means the
application does support Unicode and the text is encoded correctly
(meaning, the information about what character it is is retained even
though you can't see the character), but the font does not contain
the desired character (a far more "minor" problem).
- Likewise, if e.g. a combining "^" after an "a" simply comes out as
"a^" instead of "â", that most likely means the application *does*
correctly support Unicode and that the character is encoded
correctly, but that a 'smarter' font needs to be selected.
> The second issue is connected with the internet. Netscape, Opera,
> Firefox can all display Unicode texts without tone marks. But a
> subdotted vowel that is combined with a tone mark is displayed as TWO
> separate characters. That is why some Africans prefer to present
texts
> of their language that involve such combined characters either as pdf
> or images(screenshots).
Likewise, this is a problem with the fonts themselves, not with
Firefox/Netscape/Opera etc. In fact, apart from giving you (and the
web page designers) options to display text using a different font,
there is basically nothing further that *could* be done by the
developers of those applications even if they wanted to, because an
application can't actually "tell" if a selected font doesn't support
combining characters properly; the application has "done all it can",
so to speak. (IMO Firefox actually already goes pretty far in trying
to get things to display correctly, e.g. checking if any given
character has a representation in the current font and choosing a
different font for that character if it does not, unlike Internet
Explorer which just displays hollow squares). (I do think it is a
problem that Microsoft 'in this day and age' doesn't try to
distribute a decent general Unicode font with Windows, but let's hope
they make an effort with the upcoming Vista.)
> Each time I ask around or try to find out a lasting solution, I
always
> get the same answer: the solution lies with the font designer; he
> should produce pre-composed characters, for example: " vowel + accent
> + a dot under the vowels".
> Is this answer wrong? Or has a Unicode based solution already been
> found?
It's partially correct in that the solution lies with the font
designers, but it's not correct that they should produce pre-composed
characters (as these will be non-standard). What they should do is
design the fonts to 'know how to' correctly combine the special
"combining characters" with the preceding characters. (This is the
"correct" solution, i.e. to stick to using the standard "combining
characters" - it is Unicode-based and already supported by Unicode
and basically by all modern font rendering *systems*.)
In my view, it's very important to stick to using Unicode standard
characters / combining characters. If some combined characters don't
display properly in certain circumstances internally, then so long as
the output/products you produce look correct (depending on how you
produce and distribute the results), and it doesn't impede editing,
then it's really only a "cosmetic issue" at that point. Using the
Unicode private use area to create new (non-standard) precomposed
characters is almost like creating your own proprietary font (= not a
good idea). It's still 'early days' with Unicode, relatively
speaking, so there are still a lot of 'teething problems' especially
with decent font support and so on, but I think one should stick it
through; within the next several years we are going to be seeing more
and more (and better) fonts that correctly support e.g. combining
characters, and so long as you have stuck to the existing standards
then basically the most you'll have to do is 'select a different
font'. However, if you've encoded your data with custom 'private use
area' pre-combined characters, you'll be stuck with the proprietary
fonts you designed. You'll also not be able to properly interchange
(e.g. copy and paste) data with other Unicode apps (e.g. Word,
OpenOffice etc.) without doing custom conversions either.
(If typesetting the Unicode stuff is problematic, I still think it's
generally better to stick to Unicode for working with and storing
your data, and only do a bit of custom non-standard munging of the
data and/or fonts for the typesetting process if you have to.)
- David
---
http://tshwanedje.com/
TshwaneDJe Human Language Technology
SPONSORED LINKS
Science kits Science education Science kit for kid
Cognitive science Science education supply My first science kit
--------
YAHOO! GROUPS LINKS
* Visit your group "lexicographylist" on the web.
* To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
* Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
--------
___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list