[Lexicog] Shoebox merge and remove duplicates

Mike Maxwell maxwell at LDC.UPENN.EDU
Sun Sep 24 01:04:18 UTC 2006


Wayne Leman wrote:
> Is there to automate a process of merging my current database with the 
> older, larger one and end up with a new database which has all and only the 
> database records I want? ...
> 
> I am willing to do a visual check using some compare-files software, if I 
> can compare differences between two Shoebox databases. But the more 
> automated the process can be, without losing data, the better.

One of the fundamental problems with SFM files is that the record 
separator is the same character (CR-LF) as the field separator. 
Typically there are at two CR-LFs between records (i.e. a blank line), 
but that doesn't help much, nor does the fact that one of the field SFMs 
is defined as a record separator.

If it were me, here's what I'd do:

1) Ensure that your files do not contain any tab chars, and convert all 
sequences of multiple tab and/or space chars to a single space char.  A 
unix command to do this:
    tr -s " \t" " "
That's a space char before the \t, and a single space char between the 
second pair of quotes.  The purpose of this step is to ensure that you 
don't accidentally have any tab chars (which will mess up a later step, 
since we're going to use tab chars to separate fields), and that the two 
files you're going to merge don't have fields that differ trivially by 
the amount of whitespace they contain.  Of course, if you use two space 
chars after a period, you might not want to do this...

2) Convert all CR-LF characters in each file to a (single) tab char. 
One way to do this would be to use the unix 'tr' utility:
    tr -s "\r\n" "\t"
(The '-s' option squeezes multiple occurrences of your output tab char 
to a single tab.)  At this point, your files each consist of a single 
line, with a single tab char before each SFM.

3) Convert the tab char separating records (but not fields) into a 
newline (LF in Unix, which is what I'd be using :-), or CR-LF).  One way 
to do this is to use the unix 'sed' utility:
    sed -e "s/\t\\lx /\n\\lx /g"
(The trailing 'g' means do this multiple times; by default, sed only 
does the operation once per line.  I'm assuming your record-delineating 
SFM is \lx, modify as necessary.)  At this point, each file consists of 
a series of records separated by a single newline, and fields within 
records separated by a single tab char.

4) Pass both files together through a sorter, and have it eliminate 
duplicates.  The unix way to do this is
    sort -u
(The -u parameter means "eliminate duplicates".)  At this point, you 
have a single file consisting of non-duplicate records, sorted 
alphabetically, with a single newline separating records and a single 
tab separating fields.

5) Convert the single Unix newline to a sequence of two DOS newlines:
    sed -e "s/\n/\r\n\r\n"

6) Convert the tab chars to a single newline:
    sed -e "s/\t/\n/g"

At step 3, you could diff the two files to see if you have any nearly 
identical records.  Most diff programs will only tell if two lines 
differ; some will tell how they differ, i.e. if there are minor changes. 
  The visual diff program that comes with ComponentSoftware's RCS 
program does this (although with the long lines you're likely to have at 
step 3, such diff programs might be DIFFicult to use; guess you could do 
step 6 to put the records into temp files first...).  While you're at 
it, you might want to use RCS to track changes.

Steps 1-3 and 4-6 can each be combined into single operations using 
"pipes", avoiding some of the intermediate files:

    cat OldFile1.sfm | tr -s " \t" " " | tr -s "\r\n" "\t" | sed -e 
"s/\t\\lx /\n\\lx /g" > /tmp/OldFile1.sfm

    cat OldFile2.sfm | tr -s " \t" " " | tr -s "\r\n" "\t" | sed -e 
"s/\t\\lx /\n\\lx /g" > /tmp/OldFile2.sfm

    cat OldFile1.sfm OldFile2.sfm | sort -u | sed -e "s/\n/\r\n\r\n" | 
sed -e "s/\t/\n/g" > NewFile.sfm

All this presumes that you either have access to a Unix (Linux) machine, 
or (more likely) that you use s.t. like the CygWin Unix utilities (far 
superior to the Windows command prompt, IMHO).

Disclaimer: I haven't tested the above, there might be mistakes.

Oops, one other thing I would do, call it step 3 1/2: get rid of any 
space chars before tab chars.  These would correspond to space chars at 
the end of a line.  They're not really a problem, except that they could 
give you spurious non-identical records (if you accidentally put such 
space chars in one file but not the other).  Or maybe Shoebox enforces 
this when it saves files?

Links:

http://www.ComponentSoftware.com/ (you can use the freeware version)
http://cygwin.com/
-- 
	Mike Maxwell
	maxwell at ldc.upenn.edu


 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 



More information about the Lexicography mailing list