build a font for your endangered language...

Sat May 17 13:33:20 UTC 2008

Just catching up with this thread. As Bill pointed out, it is custom
encoding (not fonts) that is the problem.

The way I explain it to people, there are two issues with extended Latin
characters:
1) are the ones you need encoded in Unicode?
2) are they included in any existing (Unicode) fonts?

The answer to #1 might need a little research in the Unicode tables, and in
the case of less-common character+diacritic combinations, one should look at
whether a combining diacritic combination will resolve the issue (note
Andrew's response). See http://www.unicode.org/charts/ (and check also the
IPA extensions which are rather unhelpfully not also listed on that page -
see http://www.unicode.org/charts/symbols.html#PhoneticSymbols ).

Usually the problem is #2 - the lack of representation of less commonly used
characters in available fonts. 

Also, many less-commonly used characters are not totally unrepresented in
fonts, so on many computers, text made in a custom Unicode font for a
particular orthography might pick up the appropriate characters from other
fonts. Basically the realm of extended characters is not the wilderness it
once was, though there is still work to do.

FYI, the new PanAfrican Localisation Network project (which succeeds the
PanAfrican Localisation project) has a sub-project on open-source extended
Latin fonts - extending existing fonts mainly - for African orthographies. 

On re-encoding, this has been an issue in some countries in Africa too. I
may have mentioned on this list an old French funded project run by RIFAL to
convert text in legacy fonts into Unicode -
http://www.panafril10n.org/PanAfrLoc/RIFAL

Don

> -----Original Message-----
> From: Indigenous Languages and Technology
> [mailto:ILAT at LISTSERV.ARIZONA.EDU] On Behalf Of William J Poser
> Sent: Friday, May 16, 2008 6:04 PM
> To: ILAT at LISTSERV.ARIZONA.EDU
> Subject: Re: [ILAT] build a font for your endangered language...
> 
> Keola Donaghy 'uk'uneisguz:
> 
> >Aloha We created and used our custom fonts back in 1994 and are
> >still slowly trying to wean ourselves from them and switch
> >completely to Unicode.
> 
> Actually, I think this confirms what I have been saying: using
> custom fonts is NOT a problem, except in cases like cell phones
> and work machines over which users have no control, where you
> can't install them.
> 
> The problem is not custom fonts, it is custom ENCODINGS.
> 
> Since I think some people may not be clear on the distinction,
> let me explain. Text in a computer really consists of a sequence
> of character codes, which are non-negative integers. The computer
> doesn't really store an "a" - it stores a number which by convention
> is associated with the character "a". Once upon a time, in the days
> of "dumb terminals" and fixed-encoding keyboards, this was all hard-
> wired.
> When you pressed the "a" key on your keyboard it sent a certain small
> integer to your computer, and when the computer sent that same small
> integer to the terminal, the terminal displayed the corresponding
> glyph. Nowadays it is possible to program what codes are generated by
> particular keyboard events and what glyphs are displayed, but
> the basic principle is the same: text consists of a sequence of
> numbers.
> 
> What until recently was by far the most common encoding was ASCII,
> in which "a" has the character code 97. (Character codes are normally
> given in hexadecimal but I'll translate into decimal here.) "b" is
> 98, "c" is 99. "A" is 65, "B" is 66, "C" is 67, etc. So, if you
> have an ASCII-encoded font containing glyphs for the roman alphabet,
> sending the code 98 to the display will select the glyph for "b"
> and display it.
> 
> For other languages there are other encodings. If, for example,
> you use the ARMSCII7 encoding (which you might have done if you
> were an Armenian), if you send the code 98 to the display instead
> of the letter "b" you would get the Armenian capital letter cha.
> 
> Until recently, at best there was a single standard for each language
> and writing system, so that everybody would be on the same wavelength
> within that language and writing system. Fonts for Armenian or
> Russian or Hebrew or whatever would be encoded according to the
> standard for that language. Then things would be simple so long
> as you were using that language, but would get messy if, say,
> you need to use Armenian and English in the same document, or
> wanted to write in Russian on a machine set up for Hebrew.
> Furthermore, in many cases there were multiple encodings for the
> same writing system. Sometimes, every font had its own idiosyncratic
> encoding. (The champions seem to be the Ethiopians, who had over
> 40 known encodings for Amharic.)
> 
> In this situation, where every font potentially uses its own
> encoding, for other people to use your font it isn't sufficient
> for them to install it - their software has to understand its
> encoding.
> 
> With much current software, so long as your font uses a well-known
> encoding, the software can use it because it contains or knows how
> to look up information about the encoding. Your browser, for example,
> almost certainly (a) attempts to detect the encoding of the web page
> it displays and (b) allows you to tell it what encoding to use (in case
> it fails to guess correctly - this happens with some frequency, in part
> because many web pages lie about their encoding and the browser accepts
> the lie). But if you have a truly idiosyncratic encoding in your font,
> software may not know what to do with it.
> 
> What Unicode does is unify all writing systems into a single encoding.
> In Unicode "b" and Armenian capital cha do not compete for the
> same codepoint. Instead, "b" is 98 as in ASCII and Armenian capital
> cha is 1353. With everything included in a single encoding, you can
> mix writing systems easiy within a single document and use one writing
> system on a system set up for another.
> 
> So, if you create your own font but use Unicode as the encoding,
> so long as people are able to install your font they should have no
> problem using it. What you should not do is create fonts that use
> your own idiosyncratic encoding.
> 
> One of the uses of FontForge is in fact reencoding an existing font.
> You can see an example of this at:
> http://billposer.org/Linguistics/Computation/Reencoding/HowTo.html
> The examples used in this tutorial are based on a real task.
> I wanted to be able to use Linear B and at the time could only
> find a font that used an idiosyncratic encoding. So I took that
> font and changed the encoding to Unicode.
> 
> Bill