[Corpora-List] Q: How to identify duplicates in a largedocument collection

Alex Murzaku lissus at gmail.com
Wed Jan 5 14:21:00 UTC 2005


I would suggest using Lucene (http://jakarta.apache.org/lucene) which
is a transparent scaleable open source search engine library written
in Java. Not only can you find duplicates (everything that has close
to 100% similarity) but you could use it for other corpus search
endeavors. I have actually used it for clustering documents.


On Wed, 05 Jan 2005 06:33:43 -0500, William Fletcher <fletcher at usna.edu> wrote:
> Hi Marc and Normand,
> 
> How about sharing your code scripts? I am sure everyone would be grateful for an of-the-shelf solution that could be easily adapted to one's own needs or serve as inspiration for other applications.
> 
> Regards,
> Bill
> 
> >>> Marc Kupietz <kupietz at ids-mannheim.de> 1/5/2005 5:59:20 AM >>>
> Am Mittwoch, den 12.01.2005, 14:40 -0500 schrieb Normand Peladeau:
> > Sorry if my suggestion is irrelevant or inadequate, but what about creating
> > an inverted index of this document collection and using this inverted index
> > to retrieved the most similar documents.  I just implemented such an
> > algorithm and without a lot of efforts spent on speed optimization, I was
> > able to compare the similarity of a document to a collection of 100,000
> > documents indexed on about 3000 index terms and it took less than 0.4
> > seconds to retrieve the most similar documents. Increasing the spread of
> > the index or the size of the collection of documents would definitely
> > increase the computing speed but it would probably take no more than a
> > minute or two to retrieved duplicate documents in your collection.
> >
> 
> What you describe (in better terms than I did...) is indeed
> approximately part of what we do: We certainly construct an index, but
> the index keys are not a selection of terms, but hash-keys for all (or
> most - depending on the normalization function, which may delete some
> frequent uncharacteristic words) occurring n-grams (e.g.
> 5-word-sequences). Once the index construction is complete the lookup of
> (near) duplicates of a single document certainly takes almost no time.
> What actually takes 2 hours for 1.000.000 documents is the construction
> of the index and the computation of a complete similarity matrix (the
> output is certainly constrained by some minimum overlap ratio...) for
> all documents.
> 
> As you said, this is indeed extremely simple and without optimizations
> and sophisticated hash function my initial Perl-source code was less
> than a screen long. The more surprised I was that apparently (please
> correct me...) nobody did this before.
> 
> Best regards
>  Marc
> 
> --
> Marc Kupietz                                      Tel. (+49) 621/1581-409
> Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
> PO Box 101621, 68016 Mannheim, Germany        http://www.ids-mannheim.de/
> 
>



More information about the Corpora mailing list