[Lexicog] Tone languages in Toolbox/Shoebox
neduchi at NETSCAPE.NET
neduchi at NETSCAPE.NET
Mon Apr 10 09:53:47 UTC 2006
Maxwell,
You asked: "Or are you needing to represent these vowel + accent
together with a dot under the vowels (as in Yoruba)?"
Yes indeed, that is how it is needed for linguistic works; and (if I
understand the initial poster correctly) that is also the problem the
initial poster is having. I think the already available solutions
within Unicode (i.e. Uniocde points) take care of only a small section
of the problem, which is the vowels without sub-dots.
However, I am actually looking at two issues that converge at the point
of "font" programs.
The first issue is in terms of individual standalone programs that
should handle Igbo texts. I have experimented with different programs:
Concordance, Corpus Presenter, Wordsmith, ToolBox, and Word Sketch
Engine [WSE] (http://www.sketchengine.co.uk/) etc. Only ToolBox and
WordSketch Engine can handle the subdotted vowels without problem; but
a combination of subdotted vowels with the tone marks is simply an
additional burden and most programs that profess to be Unicode based
simply handle them as TWO separate characters. In other words, the
problem STILL persists. WSE is server-based and is not freely
available. In some programs (that profess to be Unicode based) one
component or the other would give you question marks for the sub-dotted
vowels, while another component of the same program would render the
text as it should. Tone marks have not yet been added!
The second issue is connected with the internet. Netscape, Opera,
Firefox can all display Unicode texts without tone marks. But a
subdotted vowel that is combined with a tone mark is displayed as TWO
separate characters. That is why some Africans prefer to present texts
of their language that involve such combined characters either as pdf
or images(screenshots).
Each time I ask around or try to find out a lasting solution, I always
get the same answer: the solution lies with the font designer; he
should produce pre-composed characters, for example: " vowel + accent +
a dot under the vowels".
Is this answer wrong? Or has a Unicode based solution already been
found?
Chinedu Uchechukwu
-----Original Message-----
From: Mike Maxwell <maxwell at ldc.upenn.edu>
To: lexicographylist at yahoogroups.com
Sent: Sat, 08 Apr 2006 12:15:56 -0400
Subject: Re: [Lexicog] Tone languages in Toolbox/Shoebox
neduchi at netscape.net wrote:
> My first language, Igbo, is also a tone language. The identified
tones
> are High, Low, and Downstep; marked with < ´ >, < `>, and < - >
> respectively. I have not found ANY software that can handle a
> combination of the tone marks with specific alphabets as single
units.
> I have explored some options and found out the following:
>
> (1) Unicode:
> The Unicode angle has been closed. It is no longer possible to add
such
> combined symbols/characters as 'new' Unicode character.
>
> (2) Fonts and Operating Systems:
> You can use specific Unicode-based fonts, like Doulos, CODE2000,
> Gentium etc. to achieve this character combination. It works fine on
> Windows OS with service park 2; Microsoft programs on the system
also
> handle the combined characters as single units. But that's not of
much
> help to the linguist: other software not produced by Microsoft do
not
> work that find.
>
> (3) Font Solution:
> Since the Unicode angle has been blocked, the buck has actually been
> passed to those involved in "Font Design". It is now the font
> developers that are to design the identified characters as
> "pre-composed" characters. A font like "Doulos" should then contain
> these "pre-composed" characters. That's when Shoebox, Toolbox, or
any
> other software can handle them the way we linguists would like it.
>
> On the other hand, since Doulos has now been made available under
the
> OpenFont License,
> (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=OFL),
may
> be some font designers might be in the position to add the
identified
> characters of African languages as "pre-composed" characters. That
> should be the permanent solution to your problem and MY problem.
I don't have any answer for the original posters' problem, but I am
puzzled by the above discussion of composed vs. non-composed
characters.
You are correct that Unicode will probably not be adding pre-composed
characters. That is, while some combinations of diacritic + base
character already exist, combinations beyond those will probably need
to
be represented in Unicode as a sequence of two or more characters.
First: I am surprised that you have had problems with representing the
characters in Igbo as pre-composed characters, but maybe I don't
understand what you're saying. There are certainly code points
(characters) in Unicode for pre-composed vowel + grave accent, vowel +
acute accent, and vowel + macron. Or are you needing to represent
these
vowel + accent together with a dot under the vowels (as in Yoruba)?
Second: Unicode-aware software ought to be able to handle
non-precomposed characters just fine. More on this below.
Third: As for adding pre-composed characters to fonts where the
pre-composed fonts are not already in Unicode (I believe that's what
you're suggesting in (3)), that is not likely to happen, and would in
my
opinion be a bad idea if it did happen. The reason of the development
of Unicode is to define a universal (well, global) way to represent
characters in writing systems. Adding pre-composed characters to what
was otherwise a Unicode font would be creating a non-standard, which
would cause problems for anyone trying to read your data who didn't
happen to have your font. Likewise if you added pre-composed
characters
to some sort of 8-bit encoding (analogous to all the ISO 8859-N
encodings running around, or even worse, to the proprietary and
undocumented Indic fonts, which are truly a mess).
There is one exception to this: Unicode sets aside an area for
user-defined characters (the Private Use Area, PUA). I believe this is
intended for characters that simply cannot be represented by combining
other Unicode characters--for Klingon, say. Again, while font
designers
_could_ use this for characters that are not defined as pre-composed
in
Unicode, the designers of such fonts are unlikely to do so, because
they
don't want to get caught up in creating and supporting non-standards,
and because as far as they are concerned, any software worth its salt
ought to support un-composed characters.
So let me come back to the problems you say you're having using
Unicode
with non-pre-composed characters. Most Unicode-capable software,
whether produced by Microsoft or not, should be capable of dealing
with
non-composed characters. Is there some specific software that doesn't
deal with non-composed characters that you need for linguistic work?
(Besides Toolbox--I'll have to let someone else address how Toolbox
works with this.)
Mike Maxwell
SPONSORED LINKS
Science kits Science education Science kit for kid
Science education supply My first science kit
--------
YAHOO! GROUPS LINKS
* Visit your group "lexicographylist" on the web.
* To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
* Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
--------
___________________________________________________
Try the New Netscape Mail Today!
Virtually Spam-Free | More Storage | Import Your Contact List
http://mail.netscape.com
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list