[Corpora-List] Q: How to identify duplicates in a large document collection
Tom Emerson
tree at basistech.com
Wed Dec 22 18:15:32 UTC 2004
Rolf,
The work of Broder et al. published at WWW6 a common root for many
duplicate document detection algorithms,
Broder, Andrei Z., Steven C. Glassman, Mark S. Manasse, and Geoffrey
Zweig. 1997. "Syntactic Clustering of the Web". In Proceedings of the
6th World Wide Web Conference (WWW6).
http://decweb.ethz.ch/WWW6/Technical/Paper205/Paper205.html
There has been quite a bit of work following on from the shingle
fingerprinting proposed in that original paper: there are 113
citations listed in CiteSeer.
We have been experimenting with various techniques for identifying
similar content on large, multilingual document collections harvested
from the Web, but are not ready to present any results.
-tree
--
Tom Emerson Basis Technology Corp.
Software Architect http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"
More information about the Corpora
mailing list