[Corpora-List] Summary: Comparing files

Tine Lassen tine.lassen at tdcadsl.dk
Tue Nov 18 22:23:16 UTC 2003


A couple a days ago I posted a question:

"I'm doing a project that involves comparing two very large word lists (~40.000 and 70.000 words). What I need to find out, is which words are on one list and not on the other (and/or vice versa).
 Can anyone give me a hint as to how to do this? (I was thinking; maybe a perl script?)"

I got an overwhelming amount of answers, and I apologize for not answering everybody personally.
Thanks to the following people, who all suggested solutions to my problem:
 
 Lluís Padró 
 David Graff 
 JL delucca
 Darren Pearce 
 Vlado Keselj 
 Ken Beesley 
 Ronald P. Reck 
 Bob Krovetz 
 Erin McKean 
 Preslav Ivanov Nakov 
 Tony Abou-Assaleh 
 Miles Osborne 
 Paul Bijnens 
 Christer.Johansson
 Donna Byron 
 Alexiei Dingli 
 Otto Lassen 
 Marco Baroni 

Several people suggested I use the unix-utilities sort in combination with diff or comm. - which of course will give me what I'm looking for. 
However, what I didn't explain in my original query, was that my lists are lexicon files, and thus are already sorted and consisting of unique tokens. Given the difference in the size of the two lists (30K words), I was hoping for a tool that maybe did a little bit more than that.  

What I chose to go with, was a perl script written by David Graff:
> 
> I wrote a perl script myself some time ago, to do just this sort of 
> thing, as well as a few other things.  You can download it from the 
> "Perl Monks" web site:
> 
>   http://www.perlmonks.org/index.pl?node_id=160735
> 
> It actually does quite a few other things besides the simple task you 
> posed: given two lists, you can output the intersection, the union, or 
> the set of elements unique to one list or the other.
> 
> Some additional options let you do this on multi-column files as well as
> simple lists, choosing any column as the "join" field for relating the 
> two input lists, and specifying a specialized field delimiter if 
> necessary.
> 
> The Perl Monks site is an excellent resource -- you can search their 
> "code catacombs", FAQ's, and watch on-going discussions as people post 
> questions/findings and get answers/reactions.
> 
>     Best regards,
> 
> Dave Graff

Another interesting suggestion came from JL delucca:

From: "JL delucca" <delucca at email.com>
To: "Tine Lassen" <tine.lassen at tdcadsl.dk>
> Tine,
> 
> Well, well, well. I am have a page on Internet http://www.dictionarium.com (main page) and a page on Computacional Linguistics Programming (all mine): http://www.dictionarium.com.dictionarium.htm
> 
> At the present time there is not abailable on my page above cited any tool to do you want, however I have tools working on Windows which to do you want.
> 
> Send me your lists and I can try to do you want.
> 
> Best wishes,
> 
> 
> J. L. De Lucca
 


Below are the rest of the answers I got:

From: "Marco Baroni" <baroni at sslmit.unibo.it>
> A perl script like the following should work (I haven't tested it, 
> though):
> 
> ************************************************************
> 
> #!/usr/bin/perl -w
> 
> open FIRSTLIST, shift;
> while (<FIRSTLIST>) {
> $infirst{$_}++;
> }
> close FIRSTLIST;
> 
> open SECONDLIST, shift;
> while (<SECONDLIST>) {
> if (!($infirst{$_})) {
> print;
>           }
> }
> close SECONDLIST;
> 
> ************************************************************
> 
> Regards,
> 
> Marco

 
From: "Alexiei Dingli" <alexiei at dcs.shef.ac.uk>
> Just add elements to a Hash data structure. It has access time O(1) so when
> u insert into the hash, you know immediately if the world already exists or
> not.
> 
> Lex
 
From: "Donna Byron" <dkbyron at yahoo.com>
> Are you on unix? If so you can use the command-line
> sort on each of your files with the -u flag to produce
> only one token of each unique item in the output, then
> diff the sorted files.  That will give you a
> type-level distinction, instead of token-level, but it
> sounds like that's what you wanted. 
>  
> Donna

From: <Christer.Johansson at lili.uib.no>
> In unix/linux you do something like
> 
> sort file1
> sort file2
> diff -23 file1 file2 > result
> less result
> 
> (I'm not actually 100% confident the best is to use diff, 
> but look at these man pages:
> man sort 
> man diff 
> apropos compare
> )
> 
> Good luck
> 
>    /Christer
 

 
From: "Paul Bijnens" <paul.bijnens at xplanation.com>
> In unix/linux:  read the man for for the command "comm":
> 
> SYNOPSIS
>         comm [OPTION]... LEFT_FILE RIGHT_FILE
> 
> DESCRIPTION
>         Compare sorted files LEFT_FILE and RIGHT_FILE line by line.
> 
>         -1     suppress lines unique to left file
> 
>         -2     suppress lines unique to right file
> 
>         -3     suppress lines that appear in both files
> 
> Paul @ Home
 
From: Otto Lassen 
Hi
That could be done in any language:
1. sort then two lists 
2. compare them word for word 
3. output words which are not in both lists
Regards
Otto Lassen
From: "Miles Osborne" <miles at inf.ed.ac.uk>
To: "Otto Lassen" <otto at lassen.mail.dk>
> that's far too slow -use a hash table instead.  
> 
> now, this wouldn't be homework, would it?
> 
> Miles
> 
From: <radev at umich.edu>
To: "Miles Osborne" <miles at inf.ed.ac.uk>
Cc: "Otto Lassen" <otto at lassen.mail.dk>; "Tine Lassen" <tine.lassen at tdcadsl.dk>; 
> Here is a UNIX script:
> 
> % sort one | uniq > one.uniq
> % sort two | uniq > two.uniq
> % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
> 
> Here is an example
> 
> one:
> ==========
> cat
> dog
> cat
> mouse
> 
> two:
> ==========
> cat
> rabbit
> elephant
> rabbit
> 
> output:
> ==========
>    3 cat
>    2 mouse
>    2 dog
>    1 rabbit
>    1 elephant
> 
> 
> Words with a count of 3 appear in both "one" and "two".
> Words with a count of 2 appear in "one" only.
> Words with a count of 1 appear in "two" only.
> 
> --
> Drago

From: "Tony Abou-Assaleh" <taa at cs.dal.ca>
To: "Miles Osborne" <miles at inf.ed.ac.uk>; "Otto Lassen" <otto at lassen.mail.dk>; "Tine Lassen" <tine.lassen at tdcadsl.dk>; <CORPORA at HD.UIB.NO>; <radev at umich.edu>
> If the files are really big, sorting them first and then merging using a
> merge sort is probably the fastest. Sorting is O(N lg N) and merging is
> O(N). Alternatively, you could concatenate, sort, and then if needed
> traverse.
> 
> However, 40K and 70K words isn't really that much, especially if you don't
> need to repeat every few seconds. So in your case, Drago's method is the
> simplest and would be fast enough.
> 
> TAA

From: "Preslav Ivanov Nakov" <nakov at eecs.berkeley.edu>
> If you have enough memory to fit the smaller one there (and for any
> reasonable wordlist size I can imagine, you should have),
> a hash table is much better than sorting - it is linear in the size of both
> files.
> 
> Preslav
> 
From: "Erin McKean" <editor at verbatimmag.com>
To: "Tine Lassen" <tine.lassen at tdcadsl.dk>
> I believe the Perl Cookbook has a good way to do this -- read the 
> words into an array and compare the arrays.
> 
> Lots of text editors, including EditPadPro, have compare functions, 
> and if you have access to a unix system the command 'diff' will work, 
> too.
> 
> Good luck!
> 
> Erin McKean
> 

From: "Bob Krovetz" <bkrovetz at askjeeves.com>
> For files that are this small, the time difference between O(n) and O(nlogn)
> isn't a big deal on typical PC's.  I would also recommend looking at the
> "comm"
> command in Unix/Linux.  With the comm command you would only have to do two
> sorts (one for each file).
>  
> >If you have enough memory to fit the smaller one there (and for any
> reasonable wordlist >size I can imagine, you should have), a hash table is
> much better than sorting - it is >linear in the size of both files.
>  
> Preslav
>  
From: "Ronald P. Reck" <rreck at iama.rrecktek.com>
> The UNIX 'diff' command can do this easily.
> 
> ie: diff file1 file2
> 
> If I can help let me know.
> 
> This is from the diff man page:
> 
> NAME
>        diff - find differences between two files
> 
> SYNOPSIS
>        diff [options] from-file to-file
> 
----- Original Message ----- 
From: "Vlado Keselj" <vlado at cs.dal.ca>
To: <radev at umich.edu>
Cc: "Miles Osborne" <miles at inf.ed.ac.uk>; "Otto Lassen" <otto at lassen.mail.dk>; "Tine Lassen" <tine.lassen at tdcadsl.dk>; <CORPORA at HD.UIB.NO>
> On Sat, 15 Nov 2003 radev at umich.edu wrote:
> 
> > Here is a UNIX script:
> >
> > % sort one | uniq > one.uniq
> > % sort two | uniq > two.uniq
> > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
> 
> A similar question was asked about 2.5 years ago on the corpora-list.
> (Is this a candidate for FAQ?)  This was my answer:
> 
> Date: Fri Apr 20 2001 - 17:04:03 MET DST
> Subject: Re: Corpora: FW: help - comparing word lists
> 
> On Unix, Linux and similar: You can sort both lists and use comm, e.g.:
> sort -u < list1 > list1.sorted
> sort -u < list2 > list2.sorted
> comm -23 list1.sorted list2.sorted
> 
> It will output the words that are on list1 but not on list2.
> Both commands are pretty efficient.

From: "Ken Beesley" <Ken.Beesley at xrce.xerox.com>
> Tine,
> 
> The standard Unix utility 'comm' will do this for you.
> 
> 
> NAME
>      comm - select or reject lines common to two files
> 
> SYNOPSIS
>      comm [-123] file1 file2
> 
> DESCRIPTION
>      The comm utility will read file1 and file2, which should  be
>      ordered in the current collating sequence, and produce three
>      text columns as output: lines only in file1; lines  only  in
>      file2; and lines in both files.
> 
>      If the input files were ordered according to  the  collating
>      sequence of the current locale, the lines written will be in
>      the collating sequence of the original lines.  If  not,  the
>      results are unspecified.
> 
> OPTIONS
>      The following options are supported:
> 
>      -1    Suppress the output column of lines unique to file1.
> 
>      -2    Suppress the output column of lines unique to file2.
> 
>      -3    Suppress the output  column  of  lines  duplicated  in
>            file1 and file2.
> 
> 
> i.e. file1 and file2 are text files that have one word on
> each line.  Each of the files must be in the current collating 
> sequence (the standard unix utility is 'sort').
> 
> It will create, by default an output file of three columns
> 
> Words-only-in-file1    Word-only-in-file-2   Words-in-both-files
> 
> 
> Say your original wordlist files are list1 and list2.  Here's
> the drill, using 'sort' to make sure that they are sorted right
> for comm.
> 
> sort list1 > list1.sorted
> sort list2 > list2.sorted
> comm list1.sorted list2.sorted > commoutput
> 
> Then examine commoutput with your favorite text editor.
> 
> You can use the optional flags as shown to suppress one or more of
> the columns of output.
> 
> Good luck.
> 
> Ken

From: "Darren Pearce" <Darren.Pearce at sussex.ac.uk>
To: "Tine Lassen" <tine.lassen at tdcadsl.dk>
Cc: "Miles Osborne" <miles at inf.ed.ac.uk>; "Otto Lassen" <otto at lassen.mail.dk>; <radev at umich.edu>; <CORPORA at HD.UIB.NO>
> A more efficient version that doesn't require sorting the concatenated
> list uses the Unix command 'comm':
> 
> % sort list1 > list1.sorted
> % sort list2 > list2.sorted
> 
> To see those items that occur in list1 but not in list2, use:
> 
> % comm -2 -3 list1.sorted list2.sorted
> 
> To see those items that occur in list2 but not in list1 use:
> 
> % comm -2 -3 list1.sorted list2.sorted
> 
> If needed, to see those items that appear in both lists, use:
> 
> % comm -1 -2 list1.sorted list2.sorted
> 
> The sorting commands could use 'uniq' but assuming the word lists
> contain no repeats in the first place, this isn't necessary. The lists
> my already be sorted too which would mean the first two commands
> wouldn't be necessary at all. (You can check if they are sorted with
> 'sort -c')
> 
> :Darren.

----- Original Message ----- 
From: "Lluís Padró" <padro at lsi.upc.es>
>   sort list1 > list1.sorted
>   sort list2 > list2.sorted
>   join -v1 list1.sorted list2.sorted 
> 
>   (if you use -v2 instead, you'll get words in list2 and not in list1)
> 
>        best
> -- 
> ------------------------------------------------------------------------
> * Lluís Padró i Cirera * UNIVERSITAT POLITÈCNICA DE CATALUNYA



More information about the Corpora mailing list