Different encodings of Cyrillic - problems
Alexandre Bougakov
bougakov at MAIL.RU
Wed Jul 5 16:28:40 UTC 2000
Hello, List,
I have received lots of questions about different encodings of
Cyrillic characters, which make our life not so easy as it should be.
I see that the problem is more serious than I thought and I will try
to summarize all what I know about it. Maybe it will help someone to
understand and solve the problem with mail agents.
All problems began about 30 years ago, when was developed the first
standard of computer representation of the characters, called ANSI.
It allowed only 128 characters in the character set for upper- and
lowercase latin characters, numbers and punctuation marks. First
computer networks were developed for American military purposes and
nobody thought that 20 years after computers will be used to write
email messages, webpages, newsgroup articles etc in other languages.
Just imagine - was it possible in the 60s, in the beginning of the
cold war - that some of the Pentagon generals could suggest some
enhancements for the American military system to display Russian or
Chinese characters?
Russian army also developed its own 7-bit (128 characters) standard,
called KOI-7, which described positions of punct. marks, numbers and
upper- and lowercase RUSSIAN characters - and no Latin ones. Don't you
see, that the military think the same way anywhere in this planet?
After some pressure from the European governments, companies and users
was developed 8-bit ANSI standard, which allowed to use 8-bit
character set, which contained 256 characters - including special
Central European, Nordic, French and German characters, and also
symbols used to draw "tables" on the screen in old MS-DOS programs.
And again there was no place for Arabic, Russian, Hebrew, Chinese,
Korean and other characters.
Russian programmers have found a trick, which allowed to use Cyrillic
characters with computers. They threw away Central European characters
from ANSI charset and replaced them with Russian ones. But the problem
was that the users of different platforms have done it in different
ways. MS-DOS programmers have thrown out one group of ANSI symbols
from the range of 128-256, UNIX-users - second, MAC-users - third one.
Since that time MS-DOS programs use CP-866, UNIX (and Linux) - KOI8-RU
(and KOI-8U in Ukraine), Macintosh - CP-10007. And ISO 8859-5,
accepted by the International Standards organization, was not used at
all (that is really funny, isn't it?).
When Microsoft developed its Windows for Workgroups, it introduced new
encoding called Windows-1251 (as well as Windows-1250 for Central
Europe), because Windows required many new symbols not found in
existing encodings (for example, paragraph sign or <<double quotes>>).
Finally, in Windows 95 OSR and Windows NT Microsoft started to use its
Unicode encodng (http://www.unicode.org). It is new 16-bit encoding,
which has enough place for all possible characters in this planet (65
thousands places in this charset is available) and makes possible to
write documents in many different languages at the same time. But the
problem is that there are not enough programs which support it.
The situation with the new electronic Babylonian Tower was partially
solved because of the marketing policy of Microsoft and the succes of
Windows + Intel platform. Now about 95% of users in Russia and ex-USSR
use different versions of Windows and have no problems with exchanging
information. Some problems occur only with exchange with publishing
houses, most of which use Macs.
But VERY VERY VERY big problems occur with e-mail systems.
Before 1995-1996 there was actually no good support for Internet and
email in Microsoft Windows (and no cheap Internet access). And
Internet in Russia was used mostly by the users of different
UNIX-clones. They were proud, that Internet was "lamers-free zone".
And even now, when about 90% of all Internet-users use Windows, most
Internet servers and gateways work on Unix.
Because of this tradition many ISPs and network administrators enforce
users to use "The Only One Right Encoding - KOI8-R". And some of them
configure their mail gateways to convert messages from other encodings
to KOI. And if the server thinks, that the message was in Win-1251 and
converts it to KOI8-R, and it was already in KOI8-R, the recipient
will see only trash instead of Russian characters. And Latin
characters will be still readable because they have the same position
in all encodings.
Another problem is that many ISPs and mail gateways do not understand
8-bit (256-characters) encodings. They throw away the 8th bit - and
the Russian text becomes unreadable. For example, CompuServe was
making this. Many old servers in American universities do so too.
The solution was to use special standard of transferring email
messages, which helps any message to survive when passing through any
mail gateway. This standard is called MIME - (Multipurpose Internet
Mail Extensions, RFC2049)
And the header of your letter, if your mailer uses MIME, will look
like:
MIME-Version: 1.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 8bit
if you are sure, that there will be no problems with old servers and
gateways ("Content-Transfer-Encoding: 8bit" shows, that no encoding is
necessary). But if some problems can occur, your mailer can "hide"
Russian or East-European characters from that old servers and prevent
possible damages. There are two possible methods of making it:
Your mailer can use "Base64" method and convert WHOLE text in safe,
but absolutely unreadable 7-bit text, and "Quoted-Printable", when
ONLY NON-LATIN characters will be converted to numbers. In all cases
only yor mailer, if it is properly-configured, can restore original
text. If it is old, or is not conforming to common standards, you will
see (in case of Mase64) something like:
7sXS18nby8kg28HM0dQ
and, in the case of Quoted-Printable:
Hello,_dear_SEELANGERS_=3A_A_=E2=EE=F2_=FD=F2=EE_=FF_=EF=E8=F8=F3_=EF
So, the second one is preferrable - even in the worst case English
text will be readable.
Now let's talk about the ways you can force your mailer to send and
receive messages with Russian characters. If you already can send and
receive such messages, forget about this problem and close this
message.
If you can not read all messages with russian characters, look at the
message introducing The Bat! mailer I've posted to this list
yesterday. It's header should contain the following lines:
Mime-Version: 1.0
Content-Type: text/plain; charset=KOI8-R
Content-Transfer-Encoding: 8bit
It shows, that I have sent plain text message (not HTML-mail), which
was created in KOI8-R, without conversion to 7-bit. But if you see
"Latin-1", "ISO-????" or something else instead of this, or/and 7-bit
instead of 8-bit, contact your ISP - it should fix it, because it is
it's fault. When I send my message, it passes through my ISPs mail
gateway (mail.ru), then it comes directly to the server at CUNY
(listserv.cuny.edu), which delivers it to the members of the list.
None of them is converting mail in any way - that means, that only
your ISP is responsible.
If you see that many Russian characters in some emails are replaced
with other Russian characters in uppercase, this probably happens
because the message was created in KOI8-R and you try to view it in
Windows-1251 or it was created in Windows, but was automatically
converted to KOI, but your mailer still thinks that it is in
Windows-1251 encoding. In this case you will see "оПХБЕР!" instead of
"Привет!" ("Privet!"). Change the encoding, and everything will be OK.
If you can not, or do not know how to do it, try to install The Bat!
and open the corrupt message in this mailer. It supports SEVEN
different variants of Cyrillic, and it can help in most cases when the
message was corrupt after conversion.
There are several programs, which can recover corrupt messages, which
were converted many times or have lost the 8-th bit. Some of them are
shareware (as, for example, MailReader, which is sold in Europe as CP
Tuner 2000) or freeware (as a small and nice plugin for MS Outlook
which can be downloaded from ftp://ftp.freeware.ru/pub/internet/mail/cyrtr12.zip ).
But if the message was converted only once, it can be recovered
automatically in "The Bat".
And what if you can read Russian characters without any problem, and
those, who you are writing to, can not?
If they usually can read Russian messages, but can not read YOUR
emails, it means, that they use normally working mailer and the
problems occur because of you or your ISP. First of all, make sure,
that you use KOI8-R to send the messages, not ISO or Mac encoding.
If it will not help, make sure, that your messages are sent as
"Quoted-printable".
Also try to change your SMTP server's name in the mailer's
preferences. Try to use smtp.comail.ru, smtp.mail.ru or smtp.web.de
(secure connection is not allowed, do not enable SSL, please) and look
at the result. It should help.
If your friends or colleagues can not read Russian messages at all,
ask them to check if they have good mailer and Russian fonts
installed. The best solution will be to install MS Outlook Express,
which comes with MS Internet Explorer 5 (package also includes several
Unicode fonts). It is free.
If you need more information about the Subj., look at the following
links:
http://www.computerra.ru/1997/12/2.html,
http://eagle.glasnet.ru/~kazarn/rus/encperv.htm,
http://www.florin.ru/win/articles/mail.html,
http://www.smartlinkcorp.com/w95/utils/cptun/index.html and
http://russia.agama.com/mailreader.
If my advices will not work - email me (and please include couple of
lines in Russian in your letter) and I will try to help you with it.
Cordially, Alexandre Bougakov <mailto:bougakov at mail.ru>
Student of the sociological faculty of the Higher School of Economics
(http://www.hse.ru/fakultet/sociology/default.html), Moscow, Russian
Federation
My website is http://SocioLink.narod.ru/ (thousands of sociology
related links on the Web - in Russian, Microsoft Internet Explorer 4
or higher is required)
My PGP key ID is 0x97F20C99, Key Fingerprint is C83C 5998 F43A BEB7
70DF B8FC CC5E 960E 97F2 0C99 (PGP version is 6.0.2i)
-------------------------------------
"The Bat!" v.1.44 - лучшая в мире почтовая программа для Windows.
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЫЬЪЭЮЯабвгдеёжзийклмнопрстуфхцчшщыьъэюя
-------------------------------------------------------------------------
Use your web browser to search the archives, control your subscription
options, and more. Visit and bookmark the SEELANGS Web Interface at:
http://members.home.net/lists/seelangs/
-------------------------------------------------------------------------
More information about the SEELANG
mailing list