[Corpora-List] Q: How to identify duplicates in a large document collection

Wed Dec 22 18:41:19 UTC 2004

The people in the IIT IR lab have a recent paper on the topic:
http://ir.iit.edu/publications/downloads/p171-chowdhury.pdf

You might contact the authors directly to see if any software is available.

	-Shlomo-

Mike Maxwell wrote:
>> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
>>
>>> We are facing the task of having to find duplicate and near-duplicate
>>> documents in a collection of about 1 million texts. Can anyone give
>>> us advice on how to approach this challenge?
>
>
> We thought about this awhile back, when it turned out we had paid for
> translation of several pairs of articles where the members of the pair
> each had different filenames.  We didn't implement a solution, but here
> are some thoughts:
>
> Do pairs of similar papers contain basically the same number of words? I
> would imagine they do, or you wouldn't be calling them "similar".
>
> I would then use file size as a heuristic, and only compare each article
> with a few of its neighbors in size.  That might reduce the complexity
> from N*N to kN, where 'k' is some (hopefully small) constant (and
> assumign that sorting them by size is not time-consuming, which it
> certainly shouldn't be).
>
> If there is variation in the way paragraphs are indicated (e.g. whether
> there is a blank line inserted) and inter-sentential spacing (one space
> character vs. two, maybe), then after converting them to plain text, you
> might find it necessary to go an additional stage and convert them into
> some kind of canonical format, such as tokenized.  There are other
> obvious normalizations you might want to apply, too.
>