[Lexicog] Shoebox merge and remove duplicates

Wayne Leman wayne_leman at SIL.ORG
Sun Sep 24 06:04:50 UTC 2006


Thanks, Mike. I just got your message. I'm back home now. I wound up downloading three different compare-files programs. I found one called WinMerge easy to work with and easy on my (aging) eyes. I was able to spot the salient differences between my files easily with WinMerge and also merge what was in the older file into the newer one easily with the program. The program can run on several platforms and can be downloaded from:

http://winmerge.org/downloads.php

It can check by different parameters. I think the default is a line-by-line check which worked fine for me.

Wayne
-----
Wayne Leman
Cheyenne dictionary online:
http://www11.asphost4free.com/cheyennedictionary/default.htm



  Wayne Leman wrote:
  > Is there to automate a process of merging my current database with the 
  > older, larger one and end up with a new database which has all and only the 
  > database records I want? ...
  > 
  > I am willing to do a visual check using some compare-files software, if I 
  > can compare differences between two Shoebox databases. But the more 
  > automated the process can be, without losing data, the better.

  One of the fundamental problems with SFM files is that the record 
  separator is the same character (CR-LF) as the field separator. 
  Typically there are at two CR-LFs between records (i.e. a blank line), 
  but that doesn't help much, nor does the fact that one of the field SFMs 
  is defined as a record separator.

  If it were me, here's what I'd do:

  1) Ensure that your files do not contain any tab chars, and convert all 
  sequences of multiple tab and/or space chars to a single space char. A 
  unix command to do this:
  tr -s " \t" " "
  That's a space char before the \t, and a single space char between the 
  second pair of quotes. The purpose of this step is to ensure that you 
  don't accidentally have any tab chars (which will mess up a later step, 
  since we're going to use tab chars to separate fields), and that the two 
  files you're going to merge don't have fields that differ trivially by 
  the amount of whitespace they contain. Of course, if you use two space 
  chars after a period, you might not want to do this...

  2) Convert all CR-LF characters in each file to a (single) tab char. 
  One way to do this would be to use the unix 'tr' utility:
  tr -s "\r\n" "\t"
  (The '-s' option squeezes multiple occurrences of your output tab char 
  to a single tab.) At this point, your files each consist of a single 
  line, with a single tab char before each SFM.

  3) Convert the tab char separating records (but not fields) into a 
  newline (LF in Unix, which is what I'd be using :-), or CR-LF). One way 
  to do this is to use the unix 'sed' utility:
  sed -e "s/\t\\lx /\n\\lx /g"
  (The trailing 'g' means do this multiple times; by default, sed only 
  does the operation once per line. I'm assuming your record-delineating 
  SFM is \lx, modify as necessary.) At this point, each file consists of 
  a series of records separated by a single newline, and fields within 
  records separated by a single tab char.

  4) Pass both files together through a sorter, and have it eliminate 
  duplicates. The unix way to do this is
  sort -u
  (The -u parameter means "eliminate duplicates".) At this point, you 
  have a single file consisting of non-duplicate records, sorted 
  alphabetically, with a single newline separating records and a single 
  tab separating fields.

  5) Convert the single Unix newline to a sequence of two DOS newlines:
  sed -e "s/\n/\r\n\r\n"

  6) Convert the tab chars to a single newline:
  sed -e "s/\t/\n/g"

  At step 3, you could diff the two files to see if you have any nearly 
  identical records. Most diff programs will only tell if two lines 
  differ; some will tell how they differ, i.e. if there are minor changes. 
  The visual diff program that comes with ComponentSoftware's RCS 
  program does this (although with the long lines you're likely to have at 
  step 3, such diff programs might be DIFFicult to use; guess you could do 
  step 6 to put the records into temp files first...). While you're at 
  it, you might want to use RCS to track changes.

  Steps 1-3 and 4-6 can each be combined into single operations using 
  "pipes", avoiding some of the intermediate files:

  cat OldFile1.sfm | tr -s " \t" " " | tr -s "\r\n" "\t" | sed -e 
  "s/\t\\lx /\n\\lx /g" > /tmp/OldFile1.sfm

  cat OldFile2.sfm | tr -s " \t" " " | tr -s "\r\n" "\t" | sed -e 
  "s/\t\\lx /\n\\lx /g" > /tmp/OldFile2.sfm

  cat OldFile1.sfm OldFile2.sfm | sort -u | sed -e "s/\n/\r\n\r\n" | 
  sed -e "s/\t/\n/g" > NewFile.sfm

  All this presumes that you either have access to a Unix (Linux) machine, 
  or (more likely) that you use s.t. like the CygWin Unix utilities (far 
  superior to the Windows command prompt, IMHO).

  Disclaimer: I haven't tested the above, there might be mistakes.

  Oops, one other thing I would do, call it step 3 1/2: get rid of any 
  space chars before tab chars. These would correspond to space chars at 
  the end of a line. They're not really a problem, except that they could 
  give you spurious non-identical records (if you accidentally put such 
  space chars in one file but not the other). Or maybe Shoebox enforces 
  this when it saves files?

  Links:

  http://www.ComponentSoftware.com/ (you can use the freeware version)
  http://cygwin.com/
  -- 
  Mike Maxwell
  maxwell at ldc.upenn.edu


   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20060923/cc08880e/attachment.htm>


More information about the Lexicography mailing list