Different encodings of Cyrillic - problems

Alexandre Bougakov bougakov at MAIL.RU
Wed Jul 5 16:28:40 UTC 2000


Hello, List,

I  have  received  lots  of  questions  about  different  encodings of
Cyrillic  characters, which make our life not so easy as it should be.
I  see  that the problem is more serious than I thought and I will try
to  summarize  all what I know about it. Maybe it will help someone to
understand and solve the problem with mail agents.

All  problems  began  about 30 years ago, when was developed the first
standard of computer representation of the characters, called ANSI.

It  allowed  only  128  characters in the character set for upper- and
lowercase  latin  characters,  numbers  and  punctuation  marks. First
computer  networks  were  developed for American military purposes and
nobody  thought  that  20  years after computers will be used to write
email  messages,  webpages, newsgroup articles etc in other languages.
Just  imagine  -  was  it possible in the 60s, in the beginning of the
cold  war  -  that  some  of  the Pentagon generals could suggest some
enhancements  for  the  American military system to display Russian or
Chinese characters?

Russian  army  also developed its own 7-bit (128 characters) standard,
called KOI-7,  which described positions of punct. marks, numbers and
upper- and lowercase RUSSIAN characters - and no Latin ones. Don't you
see, that the military think the same way anywhere in this planet?

After some pressure from the European governments, companies and users
was  developed  8-bit  ANSI  standard,  which  allowed  to  use  8-bit
character  set,  which  contained  256  characters - including special
Central  European,  Nordic,  French  and  German  characters, and also
symbols  used  to  draw "tables" on the screen in old MS-DOS programs.
And  again  there  was  no place for Arabic, Russian, Hebrew, Chinese,
Korean and other characters.

Russian  programmers have found a trick, which allowed to use Cyrillic
characters with computers. They threw away Central European characters
from ANSI charset and replaced them with Russian ones. But the problem
was  that  the  users of different platforms have done it in different
ways.  MS-DOS  programmers  have  thrown out one group of ANSI symbols
from the range of 128-256, UNIX-users - second, MAC-users - third one.
Since that time MS-DOS programs use CP-866, UNIX (and Linux) - KOI8-RU
(and  KOI-8U  in  Ukraine),  Macintosh  -  CP-10007.  And  ISO 8859-5,
accepted  by the International Standards organization, was not used at
all (that is really funny, isn't it?).

When Microsoft developed its Windows for Workgroups, it introduced new
encoding  called  Windows-1251  (as  well  as Windows-1250 for Central
Europe),  because  Windows  required  many  new  symbols  not found in
existing encodings (for example, paragraph sign or <<double quotes>>).

Finally, in Windows 95 OSR and Windows NT Microsoft started to use its
Unicode  encodng  (http://www.unicode.org). It is new 16-bit encoding,
which  has enough place for all possible characters in this planet (65
thousands  places  in this charset is available) and makes possible to
write  documents in many different languages at the same time. But the
problem is that there are not enough programs which support it.

The  situation  with the new electronic Babylonian Tower was partially
solved  because of the marketing policy of Microsoft and the succes of
Windows + Intel platform. Now about 95% of users in Russia and ex-USSR
use different versions of Windows and have no problems with exchanging
information.  Some  problems  occur only with exchange with publishing
houses, most of which use Macs.

But VERY VERY VERY big problems occur with e-mail systems.

Before  1995-1996  there was actually no good support for Internet and
email  in  Microsoft  Windows  (and  no  cheap  Internet  access). And
Internet  in  Russia  was  used  mostly  by  the  users  of  different
UNIX-clones.  They  were  proud, that Internet was "lamers-free zone".
And  even  now, when about 90% of all Internet-users use Windows, most
Internet servers and gateways work on Unix.

Because of this tradition many ISPs and network administrators enforce
users  to use "The Only One Right Encoding - KOI8-R". And some of them
configure their mail gateways to convert messages from other encodings
to KOI. And if the server thinks, that the message was in Win-1251 and
converts  it  to  KOI8-R,  and it was already in KOI8-R, the recipient
will   see  only  trash  instead  of  Russian  characters.  And  Latin
characters  will be still readable because they have the same position
in all encodings.

Another  problem is that many ISPs and mail gateways do not understand
8-bit  (256-characters)  encodings.  They throw away the 8th bit - and
the  Russian  text  becomes  unreadable.  For  example, CompuServe was
making this. Many old servers in American universities do so too.

The  solution  was  to  use  special  standard  of  transferring email
messages,  which helps any message to survive when passing through any
mail  gateway.  This  standard is called MIME - (Multipurpose Internet
Mail Extensions, RFC2049)

And  the  header  of  your letter, if your mailer uses MIME, will look
like:

MIME-Version: 1.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 8bit

if  you  are sure, that there will be no problems with old servers and
gateways ("Content-Transfer-Encoding: 8bit" shows, that no encoding is
necessary).  But  if  some  problems can occur, your mailer can "hide"
Russian  or East-European characters from that old servers and prevent
possible damages. There are two possible methods of making it:

Your  mailer  can  use "Base64" method and convert WHOLE text in safe,
but  absolutely  unreadable  7-bit  text, and "Quoted-Printable", when
ONLY  NON-LATIN  characters will be converted to numbers. In all cases
only  yor  mailer,  if it is properly-configured, can restore original
text. If it is old, or is not conforming to common standards, you will
see (in case of Mase64) something like:

7sXS18nby8kg28HM0dQ

and, in the case of Quoted-Printable:

Hello,_dear_SEELANGERS_=3A_A_=E2=EE=F2_=FD=F2=EE_=FF_=EF=E8=F8=F3_=EF

So,  the  second  one  is preferrable - even in the worst case English
text will be readable.



Now  let's  talk  about the ways you can force your mailer to send and
receive  messages with Russian characters. If you already can send and
receive  such  messages,  forget  about  this  problem  and close this
message.

If  you can not read all messages with russian characters, look at the
message   introducing  The  Bat!  mailer  I've  posted  to  this  list
yesterday. It's header should contain the following lines:

Mime-Version: 1.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 8bit

It  shows,  that I have sent plain text message (not HTML-mail), which
was  created  in  KOI8-R,  without conversion to 7-bit. But if you see
"Latin-1",  "ISO-????" or something else instead of this, or/and 7-bit
instead  of  8-bit, contact your ISP - it should fix it, because it is
it's  fault.  When  I  send my message, it passes through my ISPs mail
gateway  (mail.ru),  then  it  comes  directly  to  the server at CUNY
(listserv.cuny.edu),  which  delivers  it  to the members of the list.
None  of  them  is  converting mail in any way - that means, that only
your ISP is responsible.

If  you  see that many Russian characters in some emails  are replaced
with  other  Russian  characters  in  uppercase, this probably happens
because  the  message  was created in KOI8-R and you try to view it in
Windows-1251  or  it  was  created  in  Windows, but was automatically
converted  to  KOI,  but  your  mailer  still  thinks  that  it  is in
Windows-1251  encoding. In this case you will see "оПХБЕР!" instead of
"Привет!" ("Privet!"). Change the encoding, and everything will be OK.
If  you  can not, or do not know how to do it, try to install The Bat!
and  open  the  corrupt  message  in  this  mailer.  It supports SEVEN
different variants of Cyrillic, and it can help in most cases when the
message was corrupt after conversion.

There  are several programs, which can recover corrupt messages, which
were  converted many times or have lost the 8-th bit. Some of them are
shareware  (as, for example, MailReader, which is sold in Europe as CP
Tuner  2000)  or  freeware  (as a small and nice plugin for MS Outlook
which can be downloaded from ftp://ftp.freeware.ru/pub/internet/mail/cyrtr12.zip ).
But  if  the  message  was  converted  only  once, it can be recovered
automatically in "The Bat".

And  what  if you can read Russian characters without any problem, and
those, who you are writing to, can not?

If  they  usually  can  read  Russian  messages, but can not read YOUR
emails,  it  means,  that  they  use  normally  working mailer and the
problems  occur  because  of you or your ISP. First of all, make sure,
that  you  use  KOI8-R  to send the messages, not ISO or Mac encoding.
If  it  will  not  help,  make  sure,  that  your messages are sent as
"Quoted-printable".

Also   try   to  change  your  SMTP  server's  name  in  the  mailer's
preferences.  Try  to  use smtp.comail.ru, smtp.mail.ru or smtp.web.de
(secure connection is not allowed, do not enable SSL, please) and look
at the result. It should help.

If  your  friends  or colleagues can not read Russian messages at all,
ask  them  to  check  if  they  have  good  mailer  and  Russian fonts
installed.  The  best  solution will be to install MS Outlook Express,
which comes with MS Internet Explorer 5 (package also includes several
Unicode fonts). It is free.

If  you  need  more information about the Subj., look at the following
links:

http://www.computerra.ru/1997/12/2.html,

http://eagle.glasnet.ru/~kazarn/rus/encperv.htm,

http://www.florin.ru/win/articles/mail.html,

http://www.smartlinkcorp.com/w95/utils/cptun/index.html and

http://russia.agama.com/mailreader.

If  my  advices will not work - email me (and please include couple of
lines in Russian in your letter) and I will try to help you with it.




Cordially, Alexandre Bougakov <mailto:bougakov at mail.ru>

Student of the sociological faculty of the Higher School of Economics
(http://www.hse.ru/fakultet/sociology/default.html), Moscow, Russian
Federation

My website is http://SocioLink.narod.ru/ (thousands of sociology
related links on the Web - in Russian, Microsoft Internet Explorer 4
or higher is required)

My PGP key ID is 0x97F20C99, Key Fingerprint is C83C 5998 F43A BEB7
70DF B8FC CC5E 960E 97F2 0C99 (PGP version is 6.0.2i)

-------------------------------------

"The Bat!" v.1.44 - лучшая в мире почтовая программа для Windows.

АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЫЬЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщыьъэюя

-------------------------------------------------------------------------
 Use your web browser to search the archives, control your subscription
  options, and more.  Visit and bookmark the SEELANGS Web Interface at:
                http://members.home.net/lists/seelangs/
-------------------------------------------------------------------------



More information about the SEELANG mailing list