[Corpora-List] Q: How to identify duplicates in a large document collection

William Fletcher fletcher at usna.edu
Thu Dec 23 10:01:17 UTC 2004


Hi Ralf,

You have already received a lot of useful advice.  Let me add my own
experience to the pile of suggestions.

In my paper
"Making the Web More Useful as a Source for Linguistic Corpora"
http://kwicfinder.com/AAACL2002whf.pdf
I discuss some techniques for identifying Identical Documents, Virtually
Identical Documents (VIDs) and Highly Repetitive Documents (HRDs, e.g.
newsgroup  discussions like this).  The methods are unsophisticated but
effective.  Using fingerprinting of entire documents with the MD5
algorithm and a binary comparison tree, normalization and comparison of
11,201 document files took only 33 seconds on a 2.4 GB PC.

A single byte difference remaining in normalized (I stripped all
non-alphanumeric chars except spaces and converted all letters to lower
case) text will yield different fingerprints.  N-gram patterns
("shingles") help identify both VIDs and HRDs (i.e. document-external
vs. document-external repetition).  I was unaware of others' research in
this area and approached the problem in a manner that is not readily
scalable to huge document collections.  As I recall, low values of N
(3-5) sufficed for VIDs (= less processing), while a range of 8-10 was
more powerful for HRDs.  My colleague Hans van Halteren has found
patterns of 3-grams useful for author identification and plagiarism
detection.

Subsequently I discovered an excellent survey of relevant techniques:
"Mining the Web: Analysis of Hypertext and Semi Structured Data" by
Soumen Chakrabarti

It's clearly written and provides a wealth of useful information and
techniques, including a description of shingling.  Wish I had had this
book 10 years ago ;-) !

Good luck sorting it out,
Bill


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Sending an attachment?  See below.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

AssocProf William H. Fletcher
Language Studies Department
United States Naval Academy

On sabbatical AcYear 2004-2005 with the
Language and Speech research group at the
Radboud University of Nijmegen (NL).

Address
Groenestraat 105
6679EC Oosterhout gem. Nijmegen
Netherlands

Telephone
+31-24-3430960  [home]
+31-24-3615521  [office]
+31-24-3612907  [fax]
[6 hours later than US Eastern Time]

- - - - - - - - - - - - - - - - - - - - - - - - - - - -

Department
   http://usna.edu/LangStudy/
Phrases in English
   http://pie.usna.edu/
KWiCFinder
   http://kwicfinder.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Our mail server deletes messages with
  certain kinds of attachments without
  notifying the sender or recipient.

  If sending a .doc, .exe or .zip file, please
  rename it to delete the extension before
  sending and let me know in the body
  of the message what kind of file it is.
>>> "Ralf Steinberger" <ralf.steinberger at jrc.it> 12/22/04 11:45 AM >>>
We are facing the task of having to find duplicate and near-duplicate
documents in a collection of about 1 million texts. Can anyone give us
advice on how to approach this challenge?

The documents are in various formats (html, PDF, MS-Word, plain text,
...)
so that we intend to first convert them to plain text. It is possible
that
the same text is present in the document collection in different
formats.

For smaller collections, we identify (near)-duplicates by applying
hierarchical clustering techniques, but with this approach, we are
limited
to a few thousand documents.

Any pointers are welcome. Thank you.

Ralf Steinberger
European Commission - Joint Research Centre
http://www.jrc.it/langtech



More information about the Corpora mailing list