[Corpora-List] Q: How to identify duplicates in a large document collection

Marian Olteanu mou_softwin at yahoo.com
Thu Dec 23 05:58:10 UTC 2004


Sorry I don't have time to read the papers recomended, but if I would be in your shoes and I would
look for perfect match (detect not similar documents, but identical documents), I would compute
MD5 for each document and then I will look for duplicates. If I would encounter duplicates, I
would do a comparison between the two documents. This algorithm is practically O(n) + O(m*m)
(m=number of duplicate documents in the collection of n documents), because the probability to
encounter the same MD5 value for two different documents is very-very low (with a extremely high
probability, you will encounter no more than one false positive in MD5 comparison).
Because you have different document types, I would convert them all to a common format before
extracting MD5 value (i.e: extract text, keep only letters and digits (ignore punctuation and
spaces), uppercase everything)

--- Ralf Steinberger <ralf.steinberger at jrc.it> wrote:

> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?
>
> The documents are in various formats (html, PDF, MS-Word, plain text, ...)
> so that we intend to first convert them to plain text. It is possible that
> the same text is present in the document collection in different formats.
>
> For smaller collections, we identify (near)-duplicates by applying
> hierarchical clustering techniques, but with this approach, we are limited
> to a few thousand documents.
>
> Any pointers are welcome. Thank you.
>
> Ralf Steinberger
> European Commission - Joint Research Centre
> http://www.jrc.it/langtech
>
>


=====
Marian
http://www.utdallas.edu/~mgo031000/


		
__________________________________
Do you Yahoo!?
Yahoo! Mail - 250MB free storage. Do more. Manage less.
http://info.mail.yahoo.com/mail_250



More information about the Corpora mailing list