[Lexicog] migrating toolbox data to unicode
Ron Moe
ron_moe at SIL.ORG
Mon Apr 3 21:23:18 UTC 2006
Neal's recommendations are good, but Sebastian was asking about changes to
particular fields in a Toolbox database. In this situation the CC table
needs to be a little different than Neal's example. I use Neal's first
option--using groups. The CC table needed to work on fields is as follows:
xxxxxxxxxxxxxxxxxxx
group(main)
'\lx ' > dup use(changes)
'\xv ' > dup use(changes)
group(changes)
'A' > 'a'
'B' > 'b'
'C' > 'c'
nl '\' > dup back(1) use(main)
xxxxxxxxxxxxxxxxxxx
The first group of changes (main) is used to find those fields that are in
the vernacular (assuming that you are making changes to your vernacular
language). You need a line for each vernacular field. Essentially what each
line does is find a vernacular field and send the CC program to the
"changes" group. It doesn't make any changes. It just duplicates (that's
what "dup" means) what it finds.
The second group of changes (changes) is used to list the changes that you
want to make to the vernacular fields. You can list as many changes as you
need. (There is a limit but it is large.)
The last line finds the end of the field and sends the program back to the
"main" group to wait for the next vernacular field. Unfortunately Toolbox
sometimes puts a new line (nl) into a field. So you have to search for a new
line followed by a backslash. You then have to back up one character (that's
what "back(1)" does), backing up past the backslash, so that the program can
find the next backslash code. It's a bit tricky, but this table works well
on Toolbox databases.
You can download all sorts of useful CC tables from the DDP website
http://www.sil.org/computing/ddp/index.htm. This particular table is called
charinfi.cc (character in field).
Ron Moe
-----Original Message-----
From: lexicographylist at yahoogroups.com
[mailto:lexicographylist at yahoogroups.com]On Behalf Of
Neal_Brinneman at sil.org
Sent: Monday, April 03, 2006 9:34 AM
To: lexicographylist at yahoogroups.com
Subject: Re: [Lexicog] migrating toolbox data to unicode
Dear Sebastian,
If you want to use CC to convert to unicode, you need version 8.1.6. I will
try to attach the zipped file which is about 900 KB. I will rename it zpi
to get through our firewall. This CC will recognize U+ numbering of unicode
positions. You asked about not changing comments. while changing other
fields.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
One way is with groups and another is with switches. I will give you a
sample of the groups first.
group(main)
' c ' > dup use(nochange)
nl 'c ' > dup use(nochange)
'#y' > U+00FF c in UTF-8 this is C0 BF
c put all other changes here
group(nochange)
nl > dup use(main)
nl 'c ' > dup c stay here until you leave the comment field
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Another way to approach this in CC is with switches as follows.
nl > clear(nochange) c a new line without a comment at beginning of next
line will go back to changing characters.
' c ' > set(nochange) c all data from the c to the end of line will not
change
nl 'c ' > set(nochange)
'#y' > if(nochange) dup else U+00FF endif
c all other changes look like the previous line with the left search string
changed and the else statement changed to that character location in
unicode.
Neal
http://www.wysite.org/sites/brinnemans
Sebastian Drude
<sebadru at zedat.fu
-berlin.de> To
Sent by: lexicographylist at yahoogroups.com
lexicographylist@ cc
yahoogroups.com
Subject
[Lexicog] migrating toolbox data
04/03/2006 02:50 to unicode
AM
Please respond to
lexicographylist@
yahoogroups.com
Dear lexicographers,
I know this is no Toolbox-help-list nor a list for asking questions
about the Consistent Changes program, but I hope that somebody here at
least can point me to the right place to posit my question. Also, I
feel my questions could be of interest to other members of this list.
I am currently trying to migrate my Toolbox databases from the latin-1
(standard windows) character set to UNICODE. I also will migrate my
lexical databases to UNICODE-encoding, but the lion's share are many
annotated texts that I want to prepair to be imported to the ELAN tool
(http://www.mpi.nl/tools).
The main point is that I want to get rid of my workarounds for
characters missing in latin-1. For instance, in order to represent a
"y" with a tilde, I usually used a "ÿ" (a "y" with a trema) or
sometimes character sequences such as "~y" or "#y".
I thought this is exactly the kind of task that the SIL's old
Consistent Changes tool was designed for.
So I tried to write a consistent changes table that had entries like
the following (where "X" represents the correct character u+1EF9,
'y with a tilde'):
"ÿ" > "X"
"~y" > "X"
"#y" > "X"
I used EMACS to write this CC table and saved it in UTF-8 encoding.
However, my tests using this CC table in a toolbox export process did
not work, nor did manual conversion using CC as a stand-alone
program. It would not recognize and match my letters with a trema --
probably because the program expects these characters to be encoded as
UNICODE already, which is not the case.
There is still another problem whith this approach: in some fields,
I have German comments, and these contain lots of "ä"s, "ö"s and "ü"s,
(respectively, a, o, and u with trema) which I would rather not want to
be converted into the correspondent letter with a tilde. Is there a
way to set up a CC table where the changes are sensible to the fields
where the data to be changed is contained?
After many try-and-error, I ended up trying to hack some
EMACS-lisp-macros which eventually might do all this and save my
toolbox databases in UNICODE (UTF-8) encoding and with the workarounds
substituted by the right unicode characters in selected fields. But
still, I think a proper CC table would be better. Has anybody here
had a similar problem? Which solutions did you find?
Anyway, let's assume I managed to convert all the toolbox databases
into the new UNICODE coding format. Of course, I would have to adapt
all my Toolbox settings files, too, especially the language-type
(*.lng) files. Can I use the original setting files and adapt them,
or will I have to configure all languages and database types from
scratch?
A problem I had when trying to adapt the sort orders, for instance,
was that the dialog window would not accept the UNICODE characters
(I used the character map tool which comes with windows XP). Instead
of the character, only a question mark appears, although I checked the
Unicode-UTF-8-box in the advanced options and use a unicode-font for
the language in question. It is indeed a question mark, as Toolbox
complains that this character has been defined several times when I
try to close the configuration window.
I could of course edit the language-setting files manually using,
e.g., the EMACS. (But, by the way, the same question marks appear in
EMACS, but there I can use other commands for entering the correct
unicode characters (see the EMACS WIKI on unicode.) But I would
prefer to use the correct configuration tools that Toolbox offers.
If anybody has had experiences in migrating legacy toolbox databases
to UNICODE encoding, I would be really grateful if they could give me
some advice on this matter.
Thanks in advance,
Sebastian Drude
Yahoo! Groups Links
Yahoo! Groups Links
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list