[Corpora-List] Q: How to identify duplicates in a large document collection

Wed Dec 22 18:15:34 UTC 2004

> At 10:45 AM 12/22/2004, Ralf Steinberger wrote:
>
>> We are facing the task of having to find duplicate and near-duplicate
>> documents in a collection of about 1 million texts. Can anyone give us
>> advice on how to approach this challenge?

We thought about this awhile back, when it turned out we had paid for
translation of several pairs of articles where the members of the pair
each had different filenames.  We didn't implement a solution, but here
are some thoughts:

Do pairs of similar papers contain basically the same number of words?
I would imagine they do, or you wouldn't be calling them "similar".

I would then use file size as a heuristic, and only compare each article
with a few of its neighbors in size.  That might reduce the complexity
from N*N to kN, where 'k' is some (hopefully small) constant (and
assumign that sorting them by size is not time-consuming, which it
certainly shouldn't be).

If there is variation in the way paragraphs are indicated (e.g. whether
there is a blank line inserted) and inter-sentential spacing (one space
character vs. two, maybe), then after converting them to plain text, you
might find it necessary to go an additional stage and convert them into
some kind of canonical format, such as tokenized.  There are other
obvious normalizations you might want to apply, too.

--
	Mike Maxwell
	Linguistic Data Consortium
	maxwell at ldc.upenn.edu