[Lexicog] Encoding conversion (was: Hebrew and Greek Shoebox databases)

Mike Maxwell maxwell at LDC.UPENN.EDU
Mon Jul 30 01:43:35 UTC 2007


Cheryl Reitz wrote:
> ...maybe you can help me troubleshoot a Unicode 
> conversion problem.  I am working with a Toolbox database (5000+ 
> entries in Sudanese language, Mabaan, data collected over 30 years by 
> my neighbour Betty Miller, now 84).  It is now in an SIL font (Mabaan 
> Sophia) and I want to convert all into Arial MS Unicode font.  There 
> are several characters in the Vernacular we must search-replace to do 
> this (they show up as corrupted when we change to Arial MS Unicode).  
> We know the 4-digit hex-characters we now have and that we want to 
> convert them to

I'm confused.  "Mabaan Sophia" sounds like the name of a language + the 
name of a font.  Or is this one of the old SIL phonetic fonts where you 
could combined base characters and accents together, and assign the 
result to a code point?  But if that's what it is, I wouldn't think you 
would have four hex digits you need to convert.  (4 hex digits would be 
two bytes; there were very few two byte character encodings until 
recently, and the ones that did exist were mostly Japanese, Chinese and 
Korean)

So what I'm guessing is that you have a character encoding that had 
one-byte code points for base characters (like ordinary letter) and 
other one-byte code points for non-base characters (like an accents); 
and that the accents were non-spacing characters that preceded the base 
characters, sort of like typing an accent on a typewriter with a "dead 
key" (so-called because the keystroke did not cause the platten to 
advance), then typing the ordinary (non-dead) letter.  And you want to 
convert it into Unicode, specifically the UTF-8 encoding of Unicode.

Assuming that's correct, and that you know what the sequences are that 
you want to convert (maybe you have an old printout, or even a way of 
displaying the text on-screen using this font), the next question is 
whether all the byte sequences are unambiguous.  That is, is it always 
the case that the sequence of C5 A1 (to take an arbitrary pair of bytes) 
translates into the same Unicode character(s)?  Or does it only do so in 
certain fields?  For example, could that sequence every appear in a 
gloss field, say, where you wouldn't want to translate it into the 
Unicode characters in question?

I guess another question is whether Arial MS Unicode has the particular 
Unicode characters you need.  That's a pretty comprehensive font, but 
there are certainly characters it does not contain, and I don't even 
know what kind of writing system this is (Latin? Arabic? Ethiopic?).

Assuming certain answers to these questions, then it is probably 
possible to convert the character encoding to Unicode, but doing so with 
a search-and-replace tool (particularly if that tool doesn't have a 
"replace all" feature) will be tedious.  You might explore the SIL tool 
'cc', which I imagine does Unicode these days.  I don't use it, I would 
code up something in Python (or Perl, if I knew Perl :-)) or even C. 
But I think there are some SIL tools that might work for you better than 
cc, namely Reprise and the SILConverters pack; see 
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=Home.  I 
believe one or both of these is SFM-aware, so that if need be you can 
apply the changes to some fields in a Toolbox-style dictionary, but not 
others, should that be necessary.

 > She's an expert on Mabaan, not computers!

I would bet that character encoding conversion is only the start of the 
problems you'll find.  Things like inconsistent names for parts of 
speech, broken cross references, missing or mis-ordered fields...  It's 
extremely difficult to keep a good-sized dictionary like that consistent.
-- 
	Mike Maxwell
	maxwell at ldc.upenn.edu


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list