[Lexicog] migrating toolbox data to unicode

Sebastian Drude sebadru at ZEDAT.FU-BERLIN.DE
Mon Apr 3 06:50:02 UTC 2006


Dear lexicographers,


I know this is no Toolbox-help-list nor a list for asking questions
about the Consistent Changes program, but I hope that somebody here at
least can point me to the right place to posit my question.  Also, I
feel my questions could be of interest to other members of this list.

I am currently trying to migrate my Toolbox databases from the latin-1
(standard windows) character set to UNICODE.  I also will migrate my
lexical databases to UNICODE-encoding, but the lion's share are many
annotated texts that I want to prepair to be imported to the ELAN tool
(http://www.mpi.nl/tools).

The main point is that I want to get rid of my workarounds for
characters missing in latin-1.  For instance, in order to represent a
"y" with a tilde, I usually used a "ÿ" (a "y" with a trema) or
sometimes character sequences such as "~y" or "#y".
I thought this is exactly the kind of task that the SIL's old
Consistent Changes tool was designed for.

So I tried to write a consistent changes table that had entries like
the following (where "X" represents the correct character u+1EF9,
'y with a tilde'):

"ÿ"  > "X"
"~y" > "X"
"#y" > "X"

I used EMACS to write this CC table and saved it in UTF-8 encoding.
However, my tests using this CC table in a toolbox export process did
not work, nor did manual conversion using CC as a stand-alone
program.  It would not recognize and match my letters with a trema --
probably because the program expects these characters to be encoded as
UNICODE already, which is not the case.

There is still another problem whith this approach: in some fields,
I have German comments, and these contain lots of "ä"s, "ö"s and "ü"s,
(respectively, a, o, and u with trema) which I would rather not want to
be converted into the correspondent letter with a tilde.  Is there a
way to set up a CC table where the changes are sensible to the fields
where the data to be changed is contained?

After many try-and-error, I ended up trying to hack some
EMACS-lisp-macros which eventually might do all this and save my
toolbox databases in UNICODE (UTF-8) encoding and with the workarounds
substituted by the right unicode characters in selected fields.  But
still, I think a proper CC table would be better.  Has anybody here
had a similar problem?  Which solutions did you find?

Anyway, let's assume I managed to convert all the toolbox databases
into the new UNICODE coding format.  Of course, I would have to adapt
all my Toolbox settings files, too, especially the language-type
(*.lng) files.  Can I use the original setting files and adapt them,
or will I have to configure all languages and database types from
scratch?

A problem I had when trying to adapt the sort orders, for instance,
was that the dialog window would not accept the UNICODE characters
(I used the character map tool which comes with windows XP).  Instead
of the character, only a question mark appears, although I checked the
Unicode-UTF-8-box in the advanced options and use a unicode-font for
the language in question.  It is indeed a question mark, as Toolbox
complains that this character has been defined several times when I
try to close the configuration window.

I could of course edit the language-setting files manually using,
e.g., the EMACS.  (But, by the way, the same question marks appear in
EMACS, but there I can use other commands for entering the correct
unicode characters (see the EMACS WIKI on unicode.)  But I would
prefer to use the correct configuration tools that Toolbox offers.

If anybody has had experiences in migrating legacy toolbox databases
to UNICODE encoding, I would be really grateful if they could give me
some advice on this matter.

Thanks in advance,

Sebastian Drude



 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list