[Corpora-List] Q: How to identify duplicates in a large document collection
Marco Baroni
baroni at sslmit.unibo.it
Wed Dec 22 17:23:34 UTC 2004
I found several papers about this topic working backwards and sideways
from:
On the Evolution of Clusters of Near-Duplicate Web Pages
Dennis Fetterly; Mark Manasse; Marc Najork
http://research.microsoft.com/research/pubs/view.aspx?type=Publication&id=1096
However, I am curious if there is somebody on this list who actually
implemented a method such as the one described in this paper (based on
fingerprints of fingerprints of ``shingles'', as they call word
sequences...), and could provide more concrete advice about this important
issue.
Regards,
Marco
On Wed, 22 Dec 2004, Ralf Steinberger wrote:
> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?
More information about the Corpora
mailing list