[Corpora-List] Comparing files
Dragomir Radev
radev at umich.edu
Sat Nov 15 22:16:14 UTC 2003
Here is a UNIX script:
% sort one | uniq > one.uniq
% sort two | uniq > two.uniq
% cat one.uniq one.uniq two.uniq | sort | uniq -c | sort -nr > output
Here is an example
one:
==========
cat
dog
cat
mouse
two:
==========
cat
rabbit
elephant
rabbit
output:
==========
3 cat
2 mouse
2 dog
1 rabbit
1 elephant
Words with a count of 3 appear in both "one" and "two".
Words with a count of 2 appear in "one" only.
Words with a count of 1 appear in "two" only.
--
Drago
Miles Osborne wrote:
>
> that's far too slow -use a hash table instead.
>
> now, this wouldn't be homework, would it?
>
> Miles
>
> Quoting Otto Lassen <otto at lassen.mail.dk>:
>
> > Hi
> > That could be done in any language:
> > 1. sort then two lists
> > 2. compare them word for word
> > 3. output words which are not in both lists
> > Regards
> > Otto Lassen
> >
> > At 21:54 15-11-2003 +0100, you wrote:
> > >Hi,
> > >
> > >I'm doing a project that involves comparing two very large word lists
> >
> > >(~40.000 and 70.000 words). What I need to find out, is which words are
> > on
> > >one list and not on the other (and/or vice versa).
> > >Can anyone give me a hint as to how to do this? (I was thinking; maybe
> > a
> > >perl script?)
> > >
> > >Any help will be greatly appreciated.
> > >Best,
> > >Tine Lassen
> >
> >
>
>
--
Dragomir R. Radev radev at umich.edu
Assistant Professor of Information, Electrical Engineering and
Computer Science, and Linguistics, the University of Michigan, Ann Arbor
Phone: 734-615-5225 Fax: 734-764-2475 http://www.si.umich.edu/~radev
More information about the Corpora
mailing list