Why no Cyrillic?

Thu Feb 5 12:11:55 UTC 2009

Bill Leidy's =D0=BD=D0=B0 =D0 text raises an interesting issue. First things
first: I looked at the header (the hidden encoding information) in Bill's
e-mail. He is sending out Cyrillic-capable mail. (MIME-version:
1.0;Content-type: text/plain; **charset=UTF-8**;
format=flowed;Content-transfer-encoding:
**8BIT**) - My emphasis. This is proven by the fact that most of us can read
his line Как жаль (Kak zhal')

The strange encoding that he quoted (e.g.=D0=B0 represents Cyrillic "a") is
failed UTF-8, usually after it has hit an e-mail server that cannot handle
UTF-8 properly.

>From Wikikpedia: UTF-8 requires the transmission system to be 8-bit
clean<http://en.wikipedia.org/wiki/8-bit_clean>.
In the case of e-mail this means it has to be further encoded using *
quoted-printable <http://en.wikipedia.org/wiki/Quoted-printable> or
base64<http://en.wikipedia.org/wiki/Base64>
* [my emphasis] in some cases. This extra stage of encoding *carries a
significant size penalty* [by the standards of yesteryear -RR].

For example, the Cyrillic small "a" is Unicode 0340 hexadecimal (Base 16),
which translates to 1072 in Base 10. But e-mail systems can't send that kind
of data directly. UTF-8 makes the number longer buy more digestible by
breaking it up into component hexidecimal digits, hence =D0=B0 for "a".
Translated back into Base 10, that's a whopping 53424!

An e-mail system may fail to interpret that long sequence correctly because
it got sent out by a system that was "8-bit dirty" (see above) or because
the recipient's (a) computer or (b) mail server is not reading the header
correctly. Users themselves can cure conditions (a) and (b). (Many sites,
including GWU's Russianization
site<http://www.gwu.edu/gw-cyrillic/cyrilize.htm>,
explain how.) But =D0=B0 type text came from a server not mean to handle
internationalized mail. So the text gets quoted it "as is." По-настоящему
жаль!

Of course, for "=D0=B0" text important enough (Putin's nuclear codes?) one
could pop the entire text into Word and then write a long macro of find &
replace (it wouldn't take long to reconstruct the entire alphabet sequence
for those familiar with hexadeciml Unicode) notation. But I guess you would
have to be pretty desprate to read the mail.

Rich Robin

On Wed, Feb 4, 2009 at 6:23 PM, Bill Leidy <leidy at stanford.edu> wrote:

> Hello, I'd like to add a few words about problems with Cyrillic in e-mails.
> I get the SEELANGS in digest form, and very often the Cyrillic comes out in
> equal signs and hexadecimal numbers as you see below. I think this has
> something to do with the variety of default encodings people use or perhaps
> how the SEELANGS compiles the digest and chooses an encoding for the entire
> e-mail. Anyway, no matter how I change the character encoding in Mozilla
> Thunderbird, I can't fix the row of hexadecimal into something readable. Как
> жаль!
>
> So, unless I'm doing something wrong on my end, you can see how Cyrillic
> has a tendency to not come out correctly, even on Slavic mailing lists when
> delivered in digest form.
>

-- 
Richard M. Robin, Ph.D.
Director Russian Language Program
The George Washington University
Washington, DC 20052
202-994-7081
~~~~~~~~~~~~~~~~~~~~~~~~~~
Russkiy tekst v UTF-8

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                    http://seelangs.home.comcast.net/
-------------------------------------------------------------------------

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                    http://seelangs.home.comcast.net/
-------------------------------------------------------------------------