[Corpora-List] Q: How to identify duplicates in a largedocument collection

Mon Jan 3 12:37:15 UTC 2005

Hi Ralf and William,

we use about the same techniques as described by William to identify
near duplicates (we call them duplicates, "versions", and "variations")
and garbage in newspaper articles (i.e. DMS-dumps of newspaper
publishers). 

In addition we use a (RKR-)hash table to store and lookup the position
of n-grams (usually pentagrams). This makes the problem computationally
tractable for huge amounts of text (for the DMS-dumps: typically
log-linear in time and linear in space).

To calculate similarities (i.e. pentagram based overlap ratios) in about
1 million texts our current Perl/C-scripts take less than 2 hours on a
SunFire V880 (using 2 processors and ca. 10GB of memory). 

In a second stage we verify the hash-key based similarity matrix based
on the actual normalized texts and compute some other information like
the overlap of remaining tokens, the number of coherent sequences and
the number of transpositions. Depending on the texts and the size of the
hash-table the verification itself is not really necessary.

Based on the similarity matrix of the second stage we do some clustering
and classification, but that is another topic...

If anyone is interested in the flex/Perl/C-sources and wants to
volunteer as (beta-)tester, please mail me...

Best regards and a happy new year!
Marc

On Thu, 2004-12-23 at 11:01 +0100, William Fletcher wrote:
> Hi Ralf,
> 
> You have already received a lot of useful advice.  Let me add my own 
> experience to the pile of suggestions. 
> 
> In my paper 
> "Making the Web More Useful as a Source for Linguistic Corpora" 
> http://kwicfinder.com/AAACL2002whf.pdf 
> I discuss some techniques for identifying Identical Documents,
> Virtually 
> Identical Documents (VIDs) and Highly Repetitive Documents (HRDs,
> e.g. 
> newsgroup  discussions like this).  The methods are unsophisticated
> but 
> effective.  Using fingerprinting of entire documents with the MD5 
> algorithm and a binary comparison tree, normalization and comparison
> of 
> 11,201 document files took only 33 seconds on a 2.4 GB PC.    
> 
> A single byte difference remaining in normalized (I stripped all 
> non-alphanumeric chars except spaces and converted all letters to
> lower 
> case) text will yield different fingerprints.  N-gram patterns 
> ("shingles") help identify both VIDs and HRDs (i.e. document-external 
> vs. document-external repetition).  I was unaware of others' research
> in 
> this area and approached the problem in a manner that is not readily 
> scalable to huge document collections.  As I recall, low values of N 
> (3-5) sufficed for VIDs (= less processing), while a range of 8-10
> was 
> more powerful for HRDs.  My colleague Hans van Halteren has found 
> patterns of 3-grams useful for author identification and plagiarism 
> detection.
> 
> Subsequently I discovered an excellent survey of relevant techniques: 
> "Mining the Web: Analysis of Hypertext and Semi Structured Data" by 
> Soumen Chakrabarti 
> 
> It's clearly written and provides a wealth of useful information and 
> techniques, including a description of shingling.  Wish I had had
> this 
> book 10 years ago ;-) !
> 
> Good luck sorting it out, 
> Bill

-- 
Marc Kupietz                                       Tel. (+49)621-1581-409
Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
PO Box 101621, 68016 Mannheim, Germany        http://www.ids-mannheim.de/