[Corpora-List] Q: How to identify duplicates in a large document collection

Scott Sadowsky lists at spanishtranslator.org
Thu Dec 23 08:01:22 UTC 2004


On 12/22/2004 12:45 PM, Ralf Steinberger wrote the following:

>We are facing the task of having to find duplicate and near-duplicate 
>documents in a collection of about 1 million texts. Can anyone give us 
>advice on how to approach this challenge?

I was facing the same problem a couple years ago, with a corpus of just 
about the same size.  The closest off-the-shelf solution I found, a program 
called ABC-View, wasn't ideal because it was designed for multimedia files 
and not text.  But on a lark I contacted the developer, Nils Haeck, and 
explained the problem to him.

After asking me a series of questions about what I needed, he sent me a 
new, custom-built plug-in for his program that implemented a fuzzy text 
comparison algorithm with user-configurable parameters, which he continued 
to refine according to my specifications.

I have been using this plug-in ever since, and have eliminated several 
hundred thousand duplicate files --both plain text and HTML-- from a corpus 
that now has about 1.3 million documents.  An amazingly, it can process the 
entire collection in around a day on a clunky dual PIII 500MHz with 512 MB 
of RAM.

Besides being a top-notch programmer, Nils is also an extremely altruistic 
soul -- he not only created the plug-in for me without even mentioning 
compensation, but he also gave me a free copy of the program that runs it, 
as I need it for academic purposes.  I suggest that anyone who has need of 
such a tool contact him at <n.haeck at simdesign.nl>.

Cheers,
Scott


__________________________________________________________________
Scott Sadowsky · sadowsky at spanishtranslator.org
http://www.spanishtranslator.org
__________________________________________________________________
"Happiness is a signal that our brains use to motivate us to do certain 
things. And in the same way that our eye adapts to different levels of 
illumination, we're designed to kind of go back to the happiness set point. 
Our brains are not trying to be happy. Our brains are trying to regulate us".
  -- George Loewenstein



More information about the Corpora mailing list