[Corpora-List] Q: How to identify duplicates in a large document collection

Wed Dec 22 16:45:38 UTC 2004

We are facing the task of having to find duplicate and near-duplicate
documents in a collection of about 1 million texts. Can anyone give us
advice on how to approach this challenge?

The documents are in various formats (html, PDF, MS-Word, plain text, ...)
so that we intend to first convert them to plain text. It is possible that
the same text is present in the document collection in different formats.

For smaller collections, we identify (near)-duplicates by applying
hierarchical clustering techniques, but with this approach, we are limited
to a few thousand documents.

Any pointers are welcome. Thank you.

Ralf Steinberger
European Commission - Joint Research Centre
http://www.jrc.it/langtech

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20041222/2887c1f0/attachment.htm>