[Corpora-List] Q: How to identify duplicates in a large document collection

Wed Dec 22 18:00:56 UTC 2004

One way that I have used on a collection of about 50K documents is to
construct a suffix array + longest common prefix table. This will allow
you to identify every "exactly" repeated string in the corpus.
This can be done in (where n is total length of all documents) O(n log
n) easily or, or O(n) with a bit of work.
This assumes that the duplicated portions of the documents are exact.
If all you are doing is looking for exact duplicates then just compute a
good hash function for each file and sort them.

This is related to the problem of finding all maximal repeats in a
string of length n which can be done in linear time.

"Algorithms on Strings, Trees and Sequences: Computer Science and
Computational Biology" by
Gusfield   (CUP) 1997 is the standard textbook for this stuff.

On Wed, 2004-12-22 at 16:45, Ralf Steinberger wrote:
> We are facing the task of having to find duplicate and near-duplicate
> documents in a collection of about 1 million texts. Can anyone give us
> advice on how to approach this challenge?
>
> The documents are in various formats (html, PDF, MS-Word, plain text,
> ...) so that we intend to first convert them to plain text. It is
> possible that the same text is present in the document collection in
> different formats.
>
> For smaller collections, we identify (near)-duplicates by applying
> hierarchical clustering techniques, but with this approach, we are
> limited to a few thousand documents.
>
> Any pointers are welcome. Thank you.
>
> Ralf Steinberger
> European Commission - Joint Research Centre
> http://www.jrc.it/langtech
>
--
Alexander Clark asc at aclark.demon.co.uk alexc at cs.rhul.ac.uk
http://www.cs.rhul.ac.uk/home/alexc/
Lecturer, Department of Computer Science,
Royal Holloway, University of London, Egham, Surrey TW20 0EX
Direct 01784 443430 Department 01784 434455 Fax 01784 439786