[Corpora-List] Q: How to identify duplicates in a large document collection

Tom Emerson tree at basistech.com
Wed Dec 22 18:15:32 UTC 2004


Rolf,

The work of Broder et al. published at WWW6 a common root for many
duplicate document detection algorithms,

Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey
Zweig. 1997. "Syntactic Clustering of the Web". In Proceedings of the
6th World Wide Web Conference (WWW6).
http://decweb.ethz.ch/WWW6/Technical/Paper205/Paper205.html

There has been quite a bit of work following on from the shingle
fingerprinting proposed in that original paper: there are 113
citations listed in CiteSeer.

We have been experimenting with various techniques for identifying
similar content on large, multilingual document collections harvested
from the Web, but are not ready to present any results.

    -tree

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"



More information about the Corpora mailing list