build a font for your endangered language...

William J Poser wjposer at LDC.UPENN.EDU
Sat May 17 23:07:34 UTC 2008


Rudy,

I'm not sufficiently familiar with the specific applications you
use to be able to give detailed advice. I'm a Unix person and don't
know MS Windows very well (except for the xerox finite state transducer
tools, I have had no unfree software on my machines in nearly four years).
However, with regard to email, the problem may have to do with the nature
of the email system itself.

The email system, that is, the system by which machines move mail
around, predates the web, fancy wordprocessors, and so on. It goes
WAY back, and so at its core supports only 7-bit ASCII. To this day,
if you put anything in which the high bit of a byte might be set
into an email message, it may not make it to its destination.

The way around this is to re-encode the text using only safe
characters and decode it back to its real encoding at the other end.
The most common method for doing this nowadays is to use base64
encoding. In this method, each group of three bytes is treated as
a 24-bit string and divided up into four 6-bit chunks. Each of these
chunks is used as an index into the 64-character string:

ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

Since 2^6 = 64, each 6-bit chunk corresponds uniquely to a 
character.

So, three bytes that may take on any value are transformed into
four nice safe ASCII characters. 

When you send an image or audio file or video or MS word document
as an attachment, it is base64-encoded. If you use an email
interface that knows to do so, if you enter non-ASCII text, it will
notice it and base64 encode it.

So, one class of problems arises when people get non-ASCII text
into an email message. It may be garbled in passing through the
mail system, or it may not be handled correctly by your email
reader, which may assume that it is getting plain ASCII unless
it is notified that it is getting base64 encoded text.

Like any piece of text, email may be in a variety of encodings.
For everything to work properly, the encoding should be specified
in the mail header. That way, at the other end, the reader's mail
reading software can display it or convert to something it knows
how to display, or failing that, a human being can convert it
manually. Note that encoding in this sense is distinct from encodings
like base64 that are used to get everything into ascii. If, for
example, your original text is in, say, tis620, the Thai national
standard, it isn't safe to send it as is since this encoding
uses all eight bits. So your Thai text will have to be base64-encoded.
The base64-encoded text is then decoded back to tis620 at the other end.

So, another type of problem arises if your email system doesn't
correctly identify the character encoding in the header. (This
also happens with web pages. There is an HTML attribute that
identifies the encoding, but it is often missing or incorrect.)

Mixing encodings is another way to create problems. When you
insert text into a buffer, you are just inserting some bytes.
How they will be interpreted down the line depends on the encoding
that the downstream software thinks the text is in, and only one
encoding is associated with a piece of plain text. If you have
bits of text in different encodings and want to combine them into
a single piece of text, you need to convert them all to a single
encoding before combining them. In general the only way to do
that will be to convert to Unicode. (If you use the import text
function in a wordprocessor, the way it imports text in various
encodings is that it converts it in passing from whatever encoding
it is in to the word processor's internal encoding, which is
usually Unicode these days.)

If your printer does not display some characters that is probably
because it doesn't have the right fonts. If you can see them in MS
Word, you must have the right fonts on your system, but for some
reason the printer doesn't know about them. You need to consult
an MS Windows expert to find out what to do about this. With
regard to sending MS Word files to others, my understanding isthat
MS Word does not by default include the necessary fonts when it
saves a file, so if you are using anything non-standard, the
recipient may see gaps in place of the "exotic" characters. If
the recipient only needs to read the document, not to edit it,
exporting as PDF will generally solve this problem. I think that
I've heard that there is a way to force MS Word to include the
fonts in the document but I don't know  how.

The SIL fonts work just fine for me on my GNU/Linux systems, and
I hear that they work fine on MS Windows (which is, after all,
SIL's main target), so I don't know why they don't work for you.

Bill



More information about the Ilat mailing list