TECH: Formats

A. Vine avine at ENG.SUN.COM
Mon May 10 23:11:39 UTC 1999


James,
Can I call you Pollyanna? ;-)  just kidding.

James E. Clapp wrote:
>
> In the hope that Mark and Andrea, at least, are still interested, I'd like
> to adduce a number of comments and questions inspired by their exchange.  I
> hope that list members who think this is way off-topic (or just way
> tedious) will delete this and not hold it against me.

Interested?  It's what they pay me for!  ;-] (See what a week's vacation will do
to a brain?)

>
> 1.  I see that Andrea is right in saying that certain coding decisions are
> out of our control.  I typed some test messages to myself with a few upper
> ASCII characters (e acute, c cedilla, and the like), and sent them *not*
> using the "quoted printable" MIME option.  When I looked at the source code
> behind the messages upon receipt, they sometimes said
> "X-MIME-Autoconverted: from 8bit to quoted-printable by ..." [identifying a
> server at my ISP], and sometimes "X-MIME-Autoconverted: from 8bit to base64
> by ..." [likewise].  I haven't a clue how my ISP's computers decide what
> coding to use when, but I know they don't ask for my advice.

Their server likely has settings.  And if they don't set things to be encoded
into base64, sometimes the forwarding agent (a relay program) will change the
encoding.

(FYI, minor tech nitpick, there is no "upper ASCII".  ASCII is 7-bit, from
0-127.  Latin1, ISO-8859-1, or ANSI all refer to the 8-bit character set that
most of us are using.  This includes the upper area you are referring to.)

>
> 2.  My e-mail program offers me the following choice of settings:  "Send
> messages that use 8-bit characters as is (does not work well with some mail
> servers)" and "Send messages that use 8-bit characters using the 'quoted
> printable' MIME encoding (does not work well with some mail or discussion
> groups readers)."  Which should I choose?  (In light item 1 above, it may
> not make much difference.)

Use quoted-printable.  It is safe for 7-bit systems.
Only choose 8-bit if you have already experienced difficulty with transmitting
quoted-printable.  Incidentally, Asian languages should not be encoded using
quoted-printable, so I'm not sure how Netscape Messenger (which is the email
client James is referring to here) interprets that selection in that case.  I
suppose I can ask them.

>
> 3.  I guess we can avoid the issue by simply not using such characters
> (which is the choice I made in a recent letter to the list referring to
> "Cahiers du Cinema" without an acute accent in "Cinema"), but in
> discussions of language this is a somewhat unhappy compromise.  Mark, how
> would you like to see such characters dealt with?

Even though I'm not Mark, I have a suggestion.  You can follow the letter with
the diacritic, e.g. Cine`ma or re'ussi or e^tre or franc,ais or man~ana or
u"ber.  For German, folks usually use a following 'e' in place of an umlaut,
e.g. ueber.  The French emails I receive just omit the diacritics entirely.

>
> 4.  Given the ubiquity of diacritical marks in English writing and of HTML
> on the Web, shouldn't all employers be pressured to get e-mail software
> that, at the very least, can handle these things?  Surely we're past the
> time when every bit was so precious that institutional software could be
> expected only to handle the 128 basic ASCII characters that can be
> represented by seven bits.  (And a little HTML formatting--in
> moderation!--would go a long way toward making a long message like this
> less visually daunting.)

Probably most "employers" of the large corporate variety have modern email
systems.  But small firms (which are not outsourcing to ISPs), educational
institutions, and the like don't have the denaro (Italian).  If they are in an
English-speaking environment (or Hawaiian or Swahili or some other language
covered by ASCII), they may not feel the pressure.

>
> 5.  Personally, I don't understand why all communications software doesn't
> just use Unicode, which as I understand it handles everything from Arabic
> to Thai and beyond--including IPA.  My e-mail program includes two versions
> of Unicode as coding options (UTF-7 and 8); I have no idea what would
> happen if I selected them.  But wouldn't you think that institutions in the
> language business (including all institutions of higher learning, for a
> start)--not to mention all corporations wishing to do business
> internationally--would flat out refuse to buy any e-mail software that
> fails to support this multilingual coding?

Many email clients support this encoding.  I can't imagine a server could get
away with not supporting it.  But as for why we all can't use it, well, how much
time have you got?  It is not a trivial change from the established character
sets in wide use today to Unicode.

Try sending out your email in UTF-8, let folks know that's what your doing, and
see if anyone responds that they couldn't read your email.  (FYI, for ASCII
chars, UTF-8 and ASCII are identical.  In other words, ASCII is a bit-for-bit
subset of UTF-8.)

>
> 6.  In the meantime, for those who get strange codes whenever somebody
> tries to send a character with a diacritic, there must be tables that
> correlate the codes used in various frequently used systems
> (quoted-printable MIME, etc.) with the characters they represent.  Perhaps
> someone has access to such tables in a form that could easily be sent out
> to the list and printed out.  Obviously, the characters represented by the
> codes would have to be *described* in such a table, not simply reproduced
> as they actually look.)

Well, they sort of exist.  In quoted-printable, an encoded character consists of
an equals sign followed by its hexadecimal value.  The hex value corresponds to
the 8-bit (or less) character from the character set in question.  In other
words, if an email comes in with a MIME header stating that it is in ISO-8859-2,
and encoded in quoted-printable, if you then see the character sequence '=BA'
then it refers to 's' with a cedilla (or comma below) ','.  For charts and
explanations of the ISO-8859 character sets, the best Web resource is Roman
Czyborra's Web site:

http://czyborra.com/charsets/iso8859.html

Table lookup won't work for base64, here's a description of it in case you want
to know why:

Base64 encodes the entire sequence of characters into 6-bit values with fixed
line lengths (every 3-byte sequence is transformed into a 4-byte sequence of
6-bit values).  Each 6-bit value is indexed into a specific table of 64
characters, hence the name (values 0-25 are A-Z, 26-51 are a-z, 52-61 are 0-9,
62 is + and 63 is / ).  Lines are <=78 characters, including
carriage-return/linefeed (hex 0D and 0A) and the end is padded with 0-2 '='
characters as needed.

>
> 7.  Now that Andrea has taught us how to recognize uuencoded text when we
> see it, does anyone know where a Windows utility to decode it can be
> obtained?  (Especially for free; otherwise I'm not that interested.)

WinZip used to allow you to download a demo version, and I believe it handles
uuencoded text.  It's not very expensive if you decide to buy.  Otherwise, try
doing a search on the Web.

>
> 8.  Another big annoyance with e-mail is text wrapping.  Is there a way to
> minimize the occurrence of that awful alternating long-line/short-line text
> in mail that one receives and mail that one inflicts on others?  My e-mail
> program has a setting that says "Wrap long lines at ___ characters"; I must
> pick a number from 0 to 99999.  What choice makes the most sense?  (For the
> purpose of sending this letter, I have it set at 75.)
>
> 9.  Finally, I note that this is not the only list where concern and
> annoyance about e-mail incompatibility have been discussed.  I just joined
> something called the TechnoLawyer Discussion List, and one of the first
> things I received was a posting by the manager of the list that included
> this:

I hope you explained to this person that there are email standards which have
been around much longer than the Web or its standards?

Feel free to forward my other note about email standards to this person if
necessary.

>
> Quote:
>
> Given the proliferation of e-mail, it's only a matter of time until the
> private sector realizes that the world needs an e-mail standards
> organization with teeth similar to the one that exists for the Web.  Today,
> dozens of companies develop e-mail software (I include HotMail and other
> Web-based e-mail services in this group).  I'm all for competition -- but
> not without standards.  I hate the fact that some e-mail clients can
> accommodate HTML e-mail and others cannot.  The same goes for MIME and John
> Lederer's beloved LDAP (a very cool technology).  We must make sure that
> everyone's e-mail client can accommodate new features and technologies.  A
> standards organization may cause some consolidation in the e-mail market,
> but just think of the benefit -- seamless communication regardless of
> origin or destination.  And no hard returns! ;-)
>
> End Quote.
>
> James E. Clapp

Best,
Andrea

--
Andrea Vine
Sun Internet Mail Server i18n architect
avine at eng.sun.com
Remember: stressed is desserts spelled backwards.



More information about the Ads-l mailing list