[Corpora-List] Q: How to identify duplicates in a large document collection

Marco Baroni baroni at sslmit.unibo.it
Wed Dec 22 17:23:34 UTC 2004

I found several papers about this topic working backwards and sideways

On the Evolution of Clusters of Near-Duplicate Web Pages
Dennis Fetterly; Mark Manasse; Marc Najork

However, I am curious if there is somebody on this list who actually
implemented a method such as the one described in this paper (based on
fingerprints of fingerprints of ``shingles'', as they call word
sequences...), and could provide more concrete advice about this important



On Wed, 22 Dec 2004, Ralf Steinberger wrote:

> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?

More information about the Corpora mailing list