[Corpora-List] Comparing files

Preslav Ivanov Nakov nakov at eecs.berkeley.edu
Sat Nov 15 23:47:19 UTC 2003


If you have enough memory to fit the smaller one there (and for any
reasonable wordlist size I can imagine, you should have),
a hash table is much better than sorting - it is linear in the size of both
files.

Preslav

----- Original Message -----
From: "Tony Abou-Assaleh" <taa at cs.dal.ca>
To: "Miles Osborne" <miles at inf.ed.ac.uk>; "Otto Lassen"
<otto at lassen.mail.dk>; "Tine Lassen" <tine.lassen at tdcadsl.dk>;
<CORPORA at HD.UIB.NO>; <radev at umich.edu>
Sent: Saturday, November 15, 2003 3:25 PM
Subject: Re: [Corpora-List] Comparing files


> If the files are really big, sorting them first and then merging using a
> merge sort is probably the fastest. Sorting is O(N lg N) and merging is
> O(N). Alternatively, you could concatenate, sort, and then if needed
> traverse.
>
> However, 40K and 70K words isn't really that much, especially if you don't
> need to repeat every few seconds. So in your case, Drago's method is the
> simplest and would be fast enough.
>
> TAA
>
> --------------------------------------------------
> Tony Abou-Assaleh
> Ph.D. Candidate, Faculty of Computer Science
> Dalhousie University, Halifax, NS, Canada, B3H 1W5
> Fax:   902-492-1517
> Email: taa at acm.org
> WWW:   http://www.cs.dal.ca/~taa/
> ---------------------[THE END]--------------------
>
>
> On Sat, 15 Nov 2003 radev at umich.edu wrote:
>
> > Here is a UNIX script:
> >
> > % sort one | uniq > one.uniq
> > % sort two | uniq > two.uniq
> > % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
> >
> > Here is an example
> >
> > one:
> > ==========
> > cat
> > dog
> > cat
> > mouse
> >
> > two:
> > ==========
> > cat
> > rabbit
> > elephant
> > rabbit
> >
> > output:
> > ==========
> >    3 cat
> >    2 mouse
> >    2 dog
> >    1 rabbit
> >    1 elephant
> >
> >
> > Words with a count of 3 appear in both "one" and "two".
> > Words with a count of 2 appear in "one" only.
> > Words with a count of 1 appear in "two" only.
> >
> > --
> > Drago
> >
> >
> > Miles Osborne wrote:
> > >
> > > that's far too slow -use a hash table instead.
> > >
> > > now, this wouldn't be homework, would it?
> > >
> > > Miles
> > >
> > > Quoting Otto Lassen <otto at lassen.mail.dk>:
> > >
> > > > Hi
> > > > That could be done in any language:
> > > > 1. sort then two lists
> > > > 2. compare them word for word
> > > > 3. output words which are not in both lists
> > > > Regards
> > > > Otto Lassen
> > > >
> > > > At 21:54 15-11-2003 +0100, you wrote:
> > > > >Hi,
> > > > >
> > > > >I'm doing a project that involves comparing two very large word
lists
> > > >
> > > > >(~40.000 and 70.000 words). What I need to find out, is which words
are
> > > > on
> > > > >one list and not on the other (and/or vice versa).
> > > > >Can anyone give me a hint as to how to do this? (I was thinking;
maybe
> > > > a
> > > > >perl script?)
> > > > >
> > > > >Any help will be greatly appreciated.
> > > > >Best,
> > > > >Tine Lassen
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Dragomir R. Radev
radev at umich.edu
> > Assistant Professor of Information, Electrical Engineering and
> > Computer Science, and Linguistics, the University of Michigan, Ann Arbor
> > Phone: 734-615-5225   Fax: 734-764-2475
http://www.si.umich.edu/~radev
> >
>
>



More information about the Corpora mailing list