[Corpora-List] Q: How to identify duplicates in a large document collection

Bruce L. Lambert, Ph.D. lambertb at uic.edu
Wed Dec 22 17:46:21 UTC 2004


Ralf,

There are non-hierarchical clustering methods that might work. Look for
papers on the "scatter/gather" method. You might also try contacting the
people at Vivisimo.com who have experience clustering very large collections.

There is no quick way to do this. At some point you will have to consider
500 billion or so pairwise similarities. Using an inverted index, you can
avoid computing the zero-valued similarities, but that will still leave a
lot of non-zero similarities to deal with. Good luck.

-bruce

At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
>We are facing the task of having to find duplicate and near-duplicate
>documents in a collection of about 1 million texts. Can anyone give us
>advice on how to approach this challenge?
>
>The documents are in various formats (html, PDF, MS-Word, plain text, ...)
>so that we intend to first convert them to plain text. It is possible that
>the same text is present in the document collection in different formats.
>
>For smaller collections, we identify (near)-duplicates by applying
>hierarchical clustering techniques, but with this approach, we are limited
>to a few thousand documents.
>
>Any pointers are welcome. Thank you.
>
>Ralf Steinberger
>European Commission - Joint Research Centre
><http://www.jrc.it/langtech>http://www.jrc.it/langtech
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20041222/2a7b060e/attachment.htm>


More information about the Corpora mailing list