[Corpora-List] Comparing files

Sun Nov 16 13:30:29 UTC 2003

On Sat, 15 Nov 2003 radev at umich.edu wrote:

> Here is a UNIX script:
>
> % sort one | uniq > one.uniq
> % sort two | uniq > two.uniq
> % cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output

A similar question was asked about 2.5 years ago on the corpora-list.
(Is this a candidate for FAQ?)  This was my answer:

Date: Fri Apr 20 2001 - 17:04:03 MET DST
Subject: Re: Corpora: FW: help - comparing word lists

On Unix, Linux and similar: You can sort both lists and use comm, e.g.:
sort -u < list1 > list1.sorted
sort -u < list2 > list2.sorted
comm -23 list1.sorted list2.sorted

It will output the words that are on list1 but not on list2.
Both commands are pretty efficient.

Vlado

On Fri, 20 Apr 2001, Wiesheu, Martin wrote:

> hello out there,
>
> could anyone help me on the following question?:
>
> is there any tool or method to efficiently compare two very long word lists
> to see what words are on one list and not on the other?
>
> thanks,
>
> martin
>
>
> Martin Wiesheu
> ZGS Research
> COMMERZBANK Securities
>
> Tel. + 49 - 69 - 136 43730
> Fax. + 49 - 69 - 136 27445

> Here is an example
>
> one:
> ==========
> cat
> dog
> cat
> mouse
>
> two:
> ==========
> cat
> rabbit
> elephant
> rabbit
>
> output:
> ==========
>    3 cat
>    2 mouse
>    2 dog
>    1 rabbit
>    1 elephant
>
>
> Words with a count of 3 appear in both "one" and "two".
> Words with a count of 2 appear in "one" only.
> Words with a count of 1 appear in "two" only.
>
> --
> Drago
>
>
> Miles Osborne wrote:
> >
> > that's far too slow -use a hash table instead.
> >
> > now, this wouldn't be homework, would it?
> >
> > Miles
> >
> > Quoting Otto Lassen <otto at lassen.mail.dk>:
> >
> > > Hi
> > > That could be done in any language:
> > > 1. sort then two lists
> > > 2. compare them word for word
> > > 3. output words which are not in both lists
> > > Regards
> > > Otto Lassen
> > >
> > > At 21:54 15-11-2003 +0100, you wrote:
> > > >Hi,
> > > >
> > > >I'm doing a project that involves comparing two very large word lists
> > >
> > > >(~40.000 and 70.000 words). What I need to find out, is which words are
> > > on
> > > >one list and not on the other (and/or vice versa).
> > > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
> > > a
> > > >perl script?)
> > > >
> > > >Any help will be greatly appreciated.
> > > >Best,
> > > >Tine Lassen
> > >
> > >
> >
> >
>
>
> --
> Dragomir R. Radev                                         radev at umich.edu
> Assistant Professor of Information, Electrical Engineering and
> Computer Science, and Linguistics, the University of Michigan, Ann Arbor
> Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev
>