[Corpora-List] Q: How to identify duplicates in a large document collection

Wed Dec 22 17:23:34 UTC 2004

I found several papers about this topic working backwards and sideways
from:

On the Evolution of Clusters of Near-Duplicate Web Pages
Dennis Fetterly; Mark Manasse; Marc Najork
http://research.microsoft.com/research/pubs/view.aspx?type=Publication&id=1096

However, I am curious if there is somebody on this list who actually
implemented a method such as the one described in this paper (based on
fingerprints of fingerprints of ``shingles'', as they call word
sequences...), and could provide more concrete advice about this important
issue.

Regards,

Marco

On Wed, 22 Dec 2004, Ralf Steinberger wrote:

> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?