[Corpora-List] Q: How to identify duplicates in a large document collection

Gregor Erbach gor at acm.org
Wed Dec 22 19:59:45 UTC 2004


I know of two publications on the efficient detection of duplicates
and near-duplicates in large document collections:

Andrei Z. Broder et al.
Syntactic Clustering of the Web
http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/

US Patent 6658423
PUGH WILLIAM and HENZINGER MONIKA H
Google Inc.
Detecting duplicate and near-duplicate files
http://v3.espacenet.com/textdoc?DB=EPODOC&IDX=US6658423&F=0

regards,

       Gregor

Ralf Steinberger wrote:

> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?
>
> The documents are in various formats (html, PDF, MS-Word, plain text,
> ...) so that we intend to first convert them to plain text. It is
> possible that the same text is present in the document collection in
> different formats.
>
> For smaller collections, we identify (near)-duplicates by applying
> hierarchical clustering techniques, but with this approach, we are
> limited to a few thousand documents.
>
> Any pointers are welcome. Thank you.
>
> Ralf Steinberger
> European Commission - Joint Research Centre
> http://www.jrc.it/langtech
>

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Dr. Gregor Erbach                     http://purl.org/net/gregor/
DFKI GmbH, Language Technology Lab    http://www.dfki.de/
Tel. +49 (681) 302-5354               mailto:erbach at dfki.de



More information about the Corpora mailing list