[Lexicog] migrating toolbox data to unicode

Neal_Brinneman at SIL.ORG Neal_Brinneman at SIL.ORG
Mon Apr 3 16:34:07 UTC 2006


Dear Sebastian,
If you want to use CC to convert to unicode, you need version 8.1.6. I will
try to attach the zipped file which is about 900 KB. I will rename it zpi
to get through our firewall. This CC will recognize U+ numbering of unicode
positions. You asked about not changing comments. while changing other
fields.

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
One way is with groups and another is with switches. I will give you a
sample of the groups first.
group(main)
' c ' > dup use(nochange)
nl 'c ' > dup use(nochange)
'#y' > U+00FF c in UTF-8 this is C0 BF
c put all other changes here

group(nochange)
nl > dup use(main)
nl 'c ' > dup  c stay here until you leave the comment field

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Another way to approach this in CC is with switches as follows.
nl > clear(nochange)  c a new line without a comment at beginning of next
line will go back to changing characters.
' c ' > set(nochange)  c all data from the c to the end  of line will not
change
nl 'c ' > set(nochange)
'#y' > if(nochange) dup else U+00FF endif
c all other changes look like the previous line with the left search string
changed and the else statement changed to that character location in
unicode.
Neal

http://www.wysite.org/sites/brinnemans




                                                                           
             Sebastian Drude                                               
             <sebadru at zedat.fu                                             
             -berlin.de>                                                To 
             Sent by:                  lexicographylist at yahoogroups.com    
             lexicographylist@                                          cc 
             yahoogroups.com                                               
                                                                   Subject 
                                       [Lexicog]  migrating toolbox data   
             04/03/2006 02:50          to unicode                          
             AM                                                            
                                                                           
                                                                           
             Please respond to                                             
             lexicographylist@                                             
              yahoogroups.com                                              
                                                                           
                                                                           




Dear lexicographers,


I know this is no Toolbox-help-list nor a list for asking questions
about the Consistent Changes program, but I hope that somebody here at
least can point me to the right place to posit my question.  Also, I
feel my questions could be of interest to other members of this list.

I am currently trying to migrate my Toolbox databases from the latin-1
(standard windows) character set to UNICODE.  I also will migrate my
lexical databases to UNICODE-encoding, but the lion's share are many
annotated texts that I want to prepair to be imported to the ELAN tool
(http://www.mpi.nl/tools).

The main point is that I want to get rid of my workarounds for
characters missing in latin-1.  For instance, in order to represent a
"y" with a tilde, I usually used a "ÿ" (a "y" with a trema) or
sometimes character sequences such as "~y" or "#y".
I thought this is exactly the kind of task that the SIL's old
Consistent Changes tool was designed for.

So I tried to write a consistent changes table that had entries like
the following (where "X" represents the correct character u+1EF9,
'y with a tilde'):

"ÿ"  > "X"
"~y" > "X"
"#y" > "X"

I used EMACS to write this CC table and saved it in UTF-8 encoding.
However, my tests using this CC table in a toolbox export process did
not work, nor did manual conversion using CC as a stand-alone
program.  It would not recognize and match my letters with a trema --
probably because the program expects these characters to be encoded as
UNICODE already, which is not the case.

There is still another problem whith this approach: in some fields,
I have German comments, and these contain lots of "ä"s, "ö"s and "ü"s,
(respectively, a, o, and u with trema) which I would rather not want to
be converted into the correspondent letter with a tilde.  Is there a
way to set up a CC table where the changes are sensible to the fields
where the data to be changed is contained?

After many try-and-error, I ended up trying to hack some
EMACS-lisp-macros which eventually might do all this and save my
toolbox databases in UNICODE (UTF-8) encoding and with the workarounds
substituted by the right unicode characters in selected fields.  But
still, I think a proper CC table would be better.  Has anybody here
had a similar problem?  Which solutions did you find?

Anyway, let's assume I managed to convert all the toolbox databases
into the new UNICODE coding format.  Of course, I would have to adapt
all my Toolbox settings files, too, especially the language-type
(*.lng) files.  Can I use the original setting files and adapt them,
or will I have to configure all languages and database types from
scratch?

A problem I had when trying to adapt the sort orders, for instance,
was that the dialog window would not accept the UNICODE characters
(I used the character map tool which comes with windows XP).  Instead
of the character, only a question mark appears, although I checked the
Unicode-UTF-8-box in the advanced options and use a unicode-font for
the language in question.  It is indeed a question mark, as Toolbox
complains that this character has been defined several times when I
try to close the configuration window.

I could of course edit the language-setting files manually using,
e.g., the EMACS.  (But, by the way, the same question marks appear in
EMACS, but there I can use other commands for entering the correct
unicode characters (see the EMACS WIKI on unicode.)  But I would
prefer to use the correct configuration tools that Toolbox offers.

If anybody has had experiences in migrating legacy toolbox databases
to UNICODE encoding, I would be really grateful if they could give me
some advice on this matter.

Thanks in advance,

Sebastian Drude




Yahoo! Groups Links











 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list