8.390, Disc: Email and accented characters

linguist at linguistlist.org linguist at linguistlist.org
Wed Mar 19 00:59:36 UTC 1997


LINGUIST List:  Vol-8-390. Tue Mar 18 1997. ISSN: 1068-4875.

Subject: 8.390, Disc: Email and accented characters

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            T. Daniel Seely: Eastern Michigan U. <seely at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Associate Editors: Ljuba Veselinova <ljuba at linguistlist.org>
                   Ann Dizdar <ann at linguistlist.org>
Assistant Editor:  Sue Robinson <sue at linguistlist.org>
Technical Editor:  Ron Reck <ron at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Ann Dizdar <ann at linguistlist.org>

=================================Directory=================================

1)
Date:  Thu, 13 Mar 1997 01:41:43 -0500
From:  "David F. Stermole" <stermole at chass.utoronto.ca>
Subject:  Re: 8.343, Qs: Email and accented characters

-------------------------------- Message 1 -------------------------------

Date:  Thu, 13 Mar 1997 01:41:43 -0500
From:  "David F. Stermole" <stermole at chass.utoronto.ca>
Subject:  Re: 8.343, Qs: Email and accented characters

Ted Harding has raised a number of interesting questions regarding
email and accented characters. We, H. Allan Gleason, Jr. (Professor
Emeritus, Univ of Toronto), Henry Gleason, and David F. Stermole, as
linguists and programmers, have struggled with character encoding for
almost twenty years. We would like to contribute something to both the
discussion of and the solution to the problem.

Writing email in English has been possible from the very beginning,
because the vast majority of programmers spoke English and they were
the ones who used it. 7-bit ASCII, derived from the typewriter
keyboard with the addition of symbols programmers needed, was used as
the character set. This imposed severe limitations on even the proper
rendering of English.

As computing spread, different standards for encoding other languages
developed. By replacing some of the programmers' symbols, US ASCII was
transformed into German ASCII, French ASCII, etc. While German was
handled almost satisfactorily, French was missing accented capital
vowels and some accented vowels altogether, and neither had proper
European quotation marks (chevrons).

IBM introduced their 8-bit ASCII in the 1980s, but it was a
conglomeration of letters from Western Europe, some Greek letters and
logic symbols for mathematics, and a full set of graphics pieces to
create forms on the screen; although the European quotation marks were
introduced, even the doubling of the number of codes from 128 to 256
was not sufficient to handle French properly. Compromises had been
made again.

When Russian came into the mix, one solution was to use the old 7-bit
ASCII for English so that programming could be done and the eighth bit
signalled that characters were Russian. This meant that Russian could
be mixed with English but not with other languages.

Other efforts have included creating a set of ISO 8859 fonts (see
http://wwwwbs.cs.tu-berlin.de/~czyborra/charsets for information) to
handle various different language combinations. However, mixing
English with Greek, Russian, Polish, Slovak, and Serbian in a single
document using these fonts remains cumbersome or impossible, primarily
because word processors typically use just a single 8-bit byte to
represent each character. (A side issue is the absence of the
Ukrainian G character from the ISO-8859-5 set.) This means that only
one of the fonts is normally used for a whole document.

To encode characters from multilple languages, there are two practical
possibilities: use non-printable 8-bit codes to indicate a switch from
one character set to another or use more bits to encode the
characters. The former is an option that has not been actualized in
any widespread manner.  To handle the many ISO 8859 fonts would
require four more bits to distinguish one from another. Since
computers are currently designed to efficiently use bits in multiples
of eight, this would result in four bits being unused/wasted. Also, no
allowance was made for using accents other than the ones that were
included with the unitary characters.

Sixteen bits allows for 65,536 unique codes. Proponents of Unicode
(see its home page: http://www.stonehand.com/unicode.html) declare
that this number will suffice to handle all of the characters of all
the languages in the world. And this encoding includes a vast array of
extra floating accents. The advent of Unicode encoding now makes
multilingual email possible with the proper software. However, email
even now is still pretty much a 7-bit affair, because many email
gateways on the Internet handle only seven bits at a time. This
requires a conversion of 8-bit text to 7-bit for transmission. This
has been the job of uuencode/uudecode.

This is the rationale for our using Unicode and providing access to a
wide variety of accents. That leaves but one problem -- how to access
this vast array of characters easily. We decided on individualized
keyboard maps for each of the up to seven languages/alphabets that the
user wishes to use.  Switching from one keyboard to another is a
simple matter of tapping a function key.

To see how our proposed solution works, visit our Internet site at
http://www.panglot.com.

-

David F. Stermole
e-mail: stermole at chass.utoronto.ca

---------------------------------------------------------------------------
LINGUIST List: Vol-8-390



More information about the LINGUIST mailing list