[Corpora-List] Q: How to identify duplicates in a largedocument collection

Wed Jan 12 19:40:00 UTC 2005

Sorry if my suggestion is irrelevant or inadequate, but what about creating 
an inverted index of this document collection and using this inverted index 
to retrieved the most similar documents.  I just implemented such an 
algorithm and without a lot of efforts spent on speed optimization, I was 
able to compare the similarity of a document to a collection of 100,000 
documents indexed on about 3000 index terms and it took less than 0.4 
seconds to retrieve the most similar documents. Increasing the spread of 
the index or the size of the collection of documents would definitely 
increase the computing speed but it would probably take no more than a 
minute or two to retrieved duplicate documents in your collection.

Normand Peladeau
Provalis Research
www.simstat.com

At 1/3/2005 07:37 AM, Marc Kupietz wrote:
>Hi Ralf and William,
>
>we use about the same techniques as described by William to identify
>near duplicates (we call them duplicates, "versions", and "variations")
>and garbage in newspaper articles (i.e. DMS-dumps of newspaper
>publishers).
>
>In addition we use a (RKR-)hash table to store and lookup the position
>of n-grams (usually pentagrams). This makes the problem computationally
>tractable for huge amounts of text (for the DMS-dumps: typically
>log-linear in time and linear in space).
>
>To calculate similarities (i.e. pentagram based overlap ratios) in about
>1 million texts our current Perl/C-scripts take less than 2 hours on a
>SunFire V880 (using 2 processors and ca. 10GB of memory).
>
>In a second stage we verify the hash-key based similarity matrix based
>on the actual normalized texts and compute some other information like
>the overlap of remaining tokens, the number of coherent sequences and
>the number of transpositions. Depending on the texts and the size of the
>hash-table the verification itself is not really necessary.
>
>Based on the similarity matrix of the second stage we do some clustering
>and classification, but that is another topic...
>
>If anyone is interested in the flex/Perl/C-sources and wants to
>volunteer as (beta-)tester, please mail me...
>
>Best regards and a happy new year!
>Marc
>
>On Thu, 2004-12-23 at 11:01 +0100, William Fletcher wrote:
> > Hi Ralf,
> >
> > You have already received a lot of useful advice.  Let me add my own
> > experience to the pile of suggestions.
> >
> > In my paper
> > "Making the Web More Useful as a Source for Linguistic Corpora"
> > http://kwicfinder.com/AAACL2002whf.pdf
> > I discuss some techniques for identifying Identical Documents,
> > Virtually
> > Identical Documents (VIDs) and Highly Repetitive Documents (HRDs,
> > e.g.
> > newsgroup  discussions like this).  The methods are unsophisticated
> > but
> > effective.  Using fingerprinting of entire documents with the MD5
> > algorithm and a binary comparison tree, normalization and comparison
> > of
> > 11,201 document files took only 33 seconds on a 2.4 GB PC.
> >
> > A single byte difference remaining in normalized (I stripped all
> > non-alphanumeric chars except spaces and converted all letters to
> > lower
> > case) text will yield different fingerprints.  N-gram patterns
> > ("shingles") help identify both VIDs and HRDs (i.e. document-external
> > vs. document-external repetition).  I was unaware of others' research
> > in
> > this area and approached the problem in a manner that is not readily
> > scalable to huge document collections.  As I recall, low values of N
> > (3-5) sufficed for VIDs (= less processing), while a range of 8-10
> > was
> > more powerful for HRDs.  My colleague Hans van Halteren has found
> > patterns of 3-grams useful for author identification and plagiarism
> > detection.
> >
> > Subsequently I discovered an excellent survey of relevant techniques:
> > "Mining the Web: Analysis of Hypertext and Semi Structured Data" by
> > Soumen Chakrabarti
> >
> > It's clearly written and provides a wealth of useful information and
> > techniques, including a description of shingling.  Wish I had had
> > this
> > book 10 years ago ;-) !
> >
> > Good luck sorting it out,
> > Bill
>
>--
>Marc Kupietz                                       Tel. (+49)621-1581-409
>Institut für Deutsche Sprache, Dept. of Lexical Studies/Corpus Technology
>PO Box 101621, 68016 Mannheim, Germany        http://www.ids-mannheim.de/