[Lexicog] Encoding conversion (was: Hebrew and Greek Shoebox databases)
Mike Maxwell
maxwell at LDC.UPENN.EDU
Mon Jul 30 01:43:35 UTC 2007
Cheryl Reitz wrote:
> ...maybe you can help me troubleshoot a Unicode
> conversion problem. I am working with a Toolbox database (5000+
> entries in Sudanese language, Mabaan, data collected over 30 years by
> my neighbour Betty Miller, now 84). It is now in an SIL font (Mabaan
> Sophia) and I want to convert all into Arial MS Unicode font. There
> are several characters in the Vernacular we must search-replace to do
> this (they show up as corrupted when we change to Arial MS Unicode).
> We know the 4-digit hex-characters we now have and that we want to
> convert them to
I'm confused. "Mabaan Sophia" sounds like the name of a language + the
name of a font. Or is this one of the old SIL phonetic fonts where you
could combined base characters and accents together, and assign the
result to a code point? But if that's what it is, I wouldn't think you
would have four hex digits you need to convert. (4 hex digits would be
two bytes; there were very few two byte character encodings until
recently, and the ones that did exist were mostly Japanese, Chinese and
Korean)
So what I'm guessing is that you have a character encoding that had
one-byte code points for base characters (like ordinary letter) and
other one-byte code points for non-base characters (like an accents);
and that the accents were non-spacing characters that preceded the base
characters, sort of like typing an accent on a typewriter with a "dead
key" (so-called because the keystroke did not cause the platten to
advance), then typing the ordinary (non-dead) letter. And you want to
convert it into Unicode, specifically the UTF-8 encoding of Unicode.
Assuming that's correct, and that you know what the sequences are that
you want to convert (maybe you have an old printout, or even a way of
displaying the text on-screen using this font), the next question is
whether all the byte sequences are unambiguous. That is, is it always
the case that the sequence of C5 A1 (to take an arbitrary pair of bytes)
translates into the same Unicode character(s)? Or does it only do so in
certain fields? For example, could that sequence every appear in a
gloss field, say, where you wouldn't want to translate it into the
Unicode characters in question?
I guess another question is whether Arial MS Unicode has the particular
Unicode characters you need. That's a pretty comprehensive font, but
there are certainly characters it does not contain, and I don't even
know what kind of writing system this is (Latin? Arabic? Ethiopic?).
Assuming certain answers to these questions, then it is probably
possible to convert the character encoding to Unicode, but doing so with
a search-and-replace tool (particularly if that tool doesn't have a
"replace all" feature) will be tedious. You might explore the SIL tool
'cc', which I imagine does Unicode these days. I don't use it, I would
code up something in Python (or Perl, if I knew Perl :-)) or even C.
But I think there are some SIL tools that might work for you better than
cc, namely Reprise and the SILConverters pack; see
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&cat_id=Home. I
believe one or both of these is SFM-aware, so that if need be you can
apply the changes to some fields in a Toolbox-style dictionary, but not
others, should that be necessary.
> She's an expert on Mabaan, not computers!
I would bet that character encoding conversion is only the start of the
problems you'll find. Things like inconsistent names for parts of
speech, broken cross references, missing or mis-ordered fields... It's
extremely difficult to keep a good-sized dictionary like that consistent.
--
Mike Maxwell
maxwell at ldc.upenn.edu
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
http://groups.yahoo.com/group/lexicographylist/join
(Yahoo! ID required)
<*> To change settings via email:
mailto:lexicographylist-digest at yahoogroups.com
mailto:lexicographylist-fullfeatured at yahoogroups.com
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list