[Corpora-List] Q: How to identify duplicates in a large document collection
Scott Sadowsky
lists at spanishtranslator.org
Thu Dec 23 08:01:22 UTC 2004
On 12/22/2004 12:45 PM, Ralf Steinberger wrote the following:
>We are facing the task of having to find duplicate and near-duplicate
>documents in a collection of about 1 million texts. Can anyone give us
>advice on how to approach this challenge?
I was facing the same problem a couple years ago, with a corpus of just
about the same size. The closest off-the-shelf solution I found, a program
called ABC-View, wasn't ideal because it was designed for multimedia files
and not text. But on a lark I contacted the developer, Nils Haeck, and
explained the problem to him.
After asking me a series of questions about what I needed, he sent me a
new, custom-built plug-in for his program that implemented a fuzzy text
comparison algorithm with user-configurable parameters, which he continued
to refine according to my specifications.
I have been using this plug-in ever since, and have eliminated several
hundred thousand duplicate files --both plain text and HTML-- from a corpus
that now has about 1.3 million documents. An amazingly, it can process the
entire collection in around a day on a clunky dual PIII 500MHz with 512 MB
of RAM.
Besides being a top-notch programmer, Nils is also an extremely altruistic
soul -- he not only created the plug-in for me without even mentioning
compensation, but he also gave me a free copy of the program that runs it,
as I need it for academic purposes. I suggest that anyone who has need of
such a tool contact him at <n.haeck at simdesign.nl>.
Cheers,
Scott
__________________________________________________________________
Scott Sadowsky · sadowsky at spanishtranslator.org
http://www.spanishtranslator.org
__________________________________________________________________
"Happiness is a signal that our brains use to motivate us to do certain
things. And in the same way that our eye adapts to different levels of
illumination, we're designed to kind of go back to the happiness set point.
Our brains are not trying to be happy. Our brains are trying to regulate us".
-- George Loewenstein
More information about the Corpora
mailing list