Charsets in email - technical explanation

Wed Mar 8 20:35:28 UTC 2000

All,

Here is how characters work:

I won't get into the keyboard to software interpretation section.  Suffice it to
say that based on a combination of keys pressed, software interprets that
keyboard data into bit or byte values.  Those values are what is actually
transmitted in an email message.  What is displayed on your screen is a _local_
interpretation of those byte values.  That is, your own private machine
interprets those byte values and hunts for a font glyph to display on your
machine only.

For example, Mike gave us a table of "extended ASCII" which is usually another
name for ISO-8859-1.  What the table actually is is DOS codepage 437.  When you
hold down the <ALT> key and type 3 numbers from the number pad on a Windows
machine, you get DOS codepage values.  When you hold down the <ALT> key and type
4 numbers (i.e. a leading zero) from the number pad on a Windows machine, you
get Windows codepage values.  Most likely the people on this list will produce
Windows codepage 1252.  Windows codepage 1252 is a proprietary extension of
ISO-8859-1, having characters in the 128-159 range.

So, while Mike will see i-acute in DOS, with an underlying value of 161, someone
viewing the same value on a machine with an ISO-8859-1 display interpretation
will see upside-down exclamation point.  But the reason many of us can see
Mike's table correctly (including me on a Unix machine) is because Windows
translated the DOS codepage values, that is, the byte values, into the Windows
1252 values.  It also displays correctly because of the following mail header:

Content-type: text/plain; charset=iso-8859-1

This enables my email client software (Netscape Messenger) to properly interpret
the byte values into ISO-8859-1 display values (remember that 1252 is an
extension of 8859-1, that is, it has the same characters at the same byte
values, but has a few additional characters not found in 8859-1 in an area
8859-1 has reserved for non-displayable controls).

Does this make any sense?  Mike's keyboard data was actual DOS codepage values,
which his Windows system interpreted into 1252 codepage values, which is how the
data were stored.  The 1252 values are identical to ISO-8859-1 values.  The
email client software sent the email with the above header, labeling the values
to be ISO-8859-1 values, to signal other email clients on how to interpret the
data for display.

Now, as for quoted-printable - it has nothing to do with the charset.  Any
charset can be transformed into quoted-printable, which is essentially replacing
8-bit character values (those with the high order bit set, that is, those with
the values > 127) with an equals sign followed by the hexadecimal value.  So,
the i-acute character becomes =ED.

As for text in the Subject line, that goes through another process which you
probably don't want to know about.

But, there are some glitches.  When the email passes through various mail
servers, headers can get added or modified, and data can be transformed.  Even
though you may set your mail to send 8-bit, my server will transform it into
quoted-printable (or possibly base64) as evidenced by the following header:

X-MIME-Autoconverted: from 8bit to quoted-printable by mailserver.Sun.COM id
12345

Some clients are old and can't handle certain types of headers and formatting
associated with the MIME standard.  These clients can mangle information, and
will often show quoted-printable as is.  Sometimes people quote a message, and
their email client doesn't pay attention to the headers for the quoted message,
and so does not pass on the information necessary to interpret the quoted
content.  And some listservs mangle data.

Had enough? Just remember, what you see is not necessarily what others get.
--
Andrea Vine, avine at eng.sun.com, iPlanet i18n architect
Guilty feet have got no rhythm.
-- George Michael