[Lexicog] UNICODE

Mike Maxwell maxwell at LDC.UPENN.EDU
Tue Sep 13 16:59:04 UTC 2005


Jimm GoodTracks wrote:
> ...The key is to have easy and reliable application software programs
> available, in order to do the conversions without it becoming a chore.

Agreed.  The worst case is an undocumented hacked font for more characters 
than fit into 8 bits, such as are used for many Indic languages.  (What is 
done is to either encode a single character in more than one byte--this is 
what is done in Unicode--or to encode pieces of the glyphs in a single 
byte, then combine these pieces, rather as if 'b' was composed out of 'l' 
and 'o', and 'd' was composed out of 'o' and 'l'.)  Figuring out how these 
fonts work can be a major task.

The next worst case--and the one we had for Yoruba--is where there is a 
hacked 8-bit font, and all the characters fit into the 8 bits.  Then you 
just create a text with all 255 code points, and eyeball it to figure out 
how to recreate the characters in Unicode.  In this case, the code for the 
encoding converter is little more than a table.

The difficulty for me lay in figuring out the correct order for the stacked 
diacritics, i.e. interpreting the normalization standard.  (I got it wrong, 
and when I got it right, the font didn't work :-).)  Note that this is not 
a problem for all languages, since many writing systems don't have stacked 
diacritics.

The easy case is when you've used a standard 8-bit font, in which case 
there is almost certain to be an encoding converter already available, such 
as iconv or SIL's converter from the old Doulos IPA fonts to Unicode IPA.

The other consideration is when you've used different fonts in different 
places, e.g. in different fields in your dictionary.  If some code points 
are used differently in those fonts, then you need to run encoding 
conversion over some fields (such as fields which contain data in the 
target language) and not others (such as English gloss fields).  The SIL 
encoding converters are generally set up to allow this, particularly where 
the differences are represented as SFM-marked (Shoebox style) fields.

    Mike Maxwell
    CASL (formerly LDC)


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list