[Corpora-List] Comparing files

Dragomir Radev radev at umich.edu
Sat Nov 15 22:16:14 UTC 2003


Here is a UNIX script:

% sort one | uniq > one.uniq
% sort two | uniq > two.uniq
% cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output

Here is an example

one:
==========
cat
dog
cat
mouse

two:
==========
cat
rabbit
elephant
rabbit

output:
==========
   3 cat
   2 mouse
   2 dog
   1 rabbit
   1 elephant


Words with a count of 3 appear in both "one" and "two".
Words with a count of 2 appear in "one" only.
Words with a count of 1 appear in "two" only.

--
Drago


Miles Osborne wrote:
>
> that's far too slow -use a hash table instead.
>
> now, this wouldn't be homework, would it?
>
> Miles
>
> Quoting Otto Lassen <otto at lassen.mail.dk>:
>
> > Hi
> > That could be done in any language:
> > 1. sort then two lists
> > 2. compare them word for word
> > 3. output words which are not in both lists
> > Regards
> > Otto Lassen
> >
> > At 21:54 15-11-2003 +0100, you wrote:
> > >Hi,
> > >
> > >I'm doing a project that involves comparing two very large word lists
> >
> > >(~40.000 and 70.000 words). What I need to find out, is which words are
> > on
> > >one list and not on the other (and/or vice versa).
> > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
> > a
> > >perl script?)
> > >
> > >Any help will be greatly appreciated.
> > >Best,
> > >Tine Lassen
> >
> >
>
>


--
Dragomir R. Radev                                         radev at umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225   Fax: 734-764-2475    http://www.si.umich.edu/~radev



More information about the Corpora mailing list