[Corpora-List] Q: How to identify duplicates in a largedocument collection

Wed Jan 5 10:59:20 UTC 2005

Am Mittwoch, den 12.01.2005, 14:40 -0500 schrieb Normand Peladeau:
> Sorry if my suggestion is irrelevant or inadequate, but what about creating 
> an inverted index of this document collection and using this inverted index 
> to retrieved the most similar documents.  I just implemented such an 
> algorithm and without a lot of efforts spent on speed optimization, I was 
> able to compare the similarity of a document to a collection of 100,000 
> documents indexed on about 3000 index terms and it took less than 0.4 
> seconds to retrieve the most similar documents. Increasing the spread of 
> the index or the size of the collection of documents would definitely 
> increase the computing speed but it would probably take no more than a 
> minute or two to retrieved duplicate documents in your collection.
> 

What you describe (in better terms than I did...) is indeed
approximately part of what we do: We certainly construct an index, but
the index keys are not a selection of terms, but hash-keys for all (or
most - depending on the normalization function, which may delete some
frequent uncharacteristic words) occurring n-grams (e.g.
5-word-sequences). Once the index construction is complete the lookup of
(near) duplicates of a single document certainly takes almost no time.
What actually takes 2 hours for 1.000.000 documents is the construction
of the index and the computation of a complete similarity matrix (the
output is certainly constrained by some minimum overlap ratio...) for
all documents.

As you said, this is indeed extremely simple and without optimizations
and sophisticated hash function my initial Perl-source code was less
than a screen long. The more surprised I was that apparently (please
correct me...) nobody did this before. 

Best regards
 Marc

-- 
Marc Kupietz                                      Tel. (+49) 621/1581-409
Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
PO Box 101621, 68016 Mannheim, Germany        http://www.ids-mannheim.de/