[Lexicog] Shoebox merge and remove duplicates
Mike Maxwell
maxwell at LDC.UPENN.EDU
Sun Sep 24 01:04:18 UTC 2006
Wayne Leman wrote:
> Is there to automate a process of merging my current database with the
> older, larger one and end up with a new database which has all and only the
> database records I want? ...
>
> I am willing to do a visual check using some compare-files software, if I
> can compare differences between two Shoebox databases. But the more
> automated the process can be, without losing data, the better.
One of the fundamental problems with SFM files is that the record
separator is the same character (CR-LF) as the field separator.
Typically there are at two CR-LFs between records (i.e. a blank line),
but that doesn't help much, nor does the fact that one of the field SFMs
is defined as a record separator.
If it were me, here's what I'd do:
1) Ensure that your files do not contain any tab chars, and convert all
sequences of multiple tab and/or space chars to a single space char. A
unix command to do this:
tr -s " \t" " "
That's a space char before the \t, and a single space char between the
second pair of quotes. The purpose of this step is to ensure that you
don't accidentally have any tab chars (which will mess up a later step,
since we're going to use tab chars to separate fields), and that the two
files you're going to merge don't have fields that differ trivially by
the amount of whitespace they contain. Of course, if you use two space
chars after a period, you might not want to do this...
2) Convert all CR-LF characters in each file to a (single) tab char.
One way to do this would be to use the unix 'tr' utility:
tr -s "\r\n" "\t"
(The '-s' option squeezes multiple occurrences of your output tab char
to a single tab.) At this point, your files each consist of a single
line, with a single tab char before each SFM.
3) Convert the tab char separating records (but not fields) into a
newline (LF in Unix, which is what I'd be using :-), or CR-LF). One way
to do this is to use the unix 'sed' utility:
sed -e "s/\t\\lx /\n\\lx /g"
(The trailing 'g' means do this multiple times; by default, sed only
does the operation once per line. I'm assuming your record-delineating
SFM is \lx, modify as necessary.) At this point, each file consists of
a series of records separated by a single newline, and fields within
records separated by a single tab char.
4) Pass both files together through a sorter, and have it eliminate
duplicates. The unix way to do this is
sort -u
(The -u parameter means "eliminate duplicates".) At this point, you
have a single file consisting of non-duplicate records, sorted
alphabetically, with a single newline separating records and a single
tab separating fields.
5) Convert the single Unix newline to a sequence of two DOS newlines:
sed -e "s/\n/\r\n\r\n"
6) Convert the tab chars to a single newline:
sed -e "s/\t/\n/g"
At step 3, you could diff the two files to see if you have any nearly
identical records. Most diff programs will only tell if two lines
differ; some will tell how they differ, i.e. if there are minor changes.
The visual diff program that comes with ComponentSoftware's RCS
program does this (although with the long lines you're likely to have at
step 3, such diff programs might be DIFFicult to use; guess you could do
step 6 to put the records into temp files first...). While you're at
it, you might want to use RCS to track changes.
Steps 1-3 and 4-6 can each be combined into single operations using
"pipes", avoiding some of the intermediate files:
cat OldFile1.sfm | tr -s " \t" " " | tr -s "\r\n" "\t" | sed -e
"s/\t\\lx /\n\\lx /g" > /tmp/OldFile1.sfm
cat OldFile2.sfm | tr -s " \t" " " | tr -s "\r\n" "\t" | sed -e
"s/\t\\lx /\n\\lx /g" > /tmp/OldFile2.sfm
cat OldFile1.sfm OldFile2.sfm | sort -u | sed -e "s/\n/\r\n\r\n" |
sed -e "s/\t/\n/g" > NewFile.sfm
All this presumes that you either have access to a Unix (Linux) machine,
or (more likely) that you use s.t. like the CygWin Unix utilities (far
superior to the Windows command prompt, IMHO).
Disclaimer: I haven't tested the above, there might be mistakes.
Oops, one other thing I would do, call it step 3 1/2: get rid of any
space chars before tab chars. These would correspond to space chars at
the end of a line. They're not really a problem, except that they could
give you spurious non-identical records (if you accidentally put such
space chars in one file but not the other). Or maybe Shoebox enforces
this when it saves files?
Links:
http://www.ComponentSoftware.com/ (you can use the freeware version)
http://cygwin.com/
--
Mike Maxwell
maxwell at ldc.upenn.edu
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
http://groups.yahoo.com/group/lexicographylist/join
(Yahoo! ID required)
<*> To change settings via email:
mailto:lexicographylist-digest at yahoogroups.com
mailto:lexicographylist-fullfeatured at yahoogroups.com
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list