build a font for your endangered language...

William J Poser wjposer at LDC.UPENN.EDU
Fri May 16 22:04:13 UTC 2008


Keola Donaghy 'uk'uneisguz:

>Aloha We created and used our custom fonts back in 1994 and are
>still slowly trying to wean ourselves from them and switch
>completely to Unicode.

Actually, I think this confirms what I have been saying: using
custom fonts is NOT a problem, except in cases like cell phones
and work machines over which users have no control, where you
can't install them.

The problem is not custom fonts, it is custom ENCODINGS.

Since I think some people may not be clear on the distinction,
let me explain. Text in a computer really consists of a sequence
of character codes, which are non-negative integers. The computer
doesn't really store an "a" - it stores a number which by convention
is associated with the character "a". Once upon a time, in the days
of "dumb terminals" and fixed-encoding keyboards, this was all hard-wired.
When you pressed the "a" key on your keyboard it sent a certain small
integer to your computer, and when the computer sent that same small
integer to the terminal, the terminal displayed the corresponding
glyph. Nowadays it is possible to program what codes are generated by
particular keyboard events and what glyphs are displayed, but
the basic principle is the same: text consists of a sequence of
numbers.

What until recently was by far the most common encoding was ASCII,
in which "a" has the character code 97. (Character codes are normally
given in hexadecimal but I'll translate into decimal here.) "b" is
98, "c" is 99. "A" is 65, "B" is 66, "C" is 67, etc. So, if you
have an ASCII-encoded font containing glyphs for the roman alphabet,
sending the code 98 to the display will select the glyph for "b"
and display it.

For other languages there are other encodings. If, for example,
you use the ARMSCII7 encoding (which you might have done if you
were an Armenian), if you send the code 98 to the display instead
of the letter "b" you would get the Armenian capital letter cha.

Until recently, at best there was a single standard for each language
and writing system, so that everybody would be on the same wavelength
within that language and writing system. Fonts for Armenian or
Russian or Hebrew or whatever would be encoded according to the
standard for that language. Then things would be simple so long
as you were using that language, but would get messy if, say,
you need to use Armenian and English in the same document, or
wanted to write in Russian on a machine set up for Hebrew.
Furthermore, in many cases there were multiple encodings for the
same writing system. Sometimes, every font had its own idiosyncratic
encoding. (The champions seem to be the Ethiopians, who had over
40 known encodings for Amharic.)

In this situation, where every font potentially uses its own
encoding, for other people to use your font it isn't sufficient
for them to install it - their software has to understand its
encoding.

With much current software, so long as your font uses a well-known
encoding, the software can use it because it contains or knows how
to look up information about the encoding. Your browser, for example,
almost certainly (a) attempts to detect the encoding of the web page
it displays and (b) allows you to tell it what encoding to use (in case
it fails to guess correctly - this happens with some frequency, in part
because many web pages lie about their encoding and the browser accepts
the lie). But if you have a truly idiosyncratic encoding in your font,
software may not know what to do with it.

What Unicode does is unify all writing systems into a single encoding.
In Unicode "b" and Armenian capital cha do not compete for the
same codepoint. Instead, "b" is 98 as in ASCII and Armenian capital
cha is 1353. With everything included in a single encoding, you can
mix writing systems easiy within a single document and use one writing
system on a system set up for another.

So, if you create your own font but use Unicode as the encoding,
so long as people are able to install your font they should have no
problem using it. What you should not do is create fonts that use
your own idiosyncratic encoding.

One of the uses of FontForge is in fact reencoding an existing font.
You can see an example of this at:
http://billposer.org/Linguistics/Computation/Reencoding/HowTo.html
The examples used in this tutorial are based on a real task.
I wanted to be able to use Linear B and at the time could only
find a font that used an idiosyncratic encoding. So I took that
font and changed the encoding to Unicode.

Bill



More information about the Ilat mailing list