<html>

<body>

Ralf,<br><br>

There are non-hierarchical clustering methods that might work. Look for

papers on the "scatter/gather" method. You might also try

contacting the people at Vivisimo.com who have experience clustering very

large collections.<br><br>

There is no quick way to do this. At some point you will have to consider

500 billion or so pairwise similarities. Using an inverted index, you can

avoid computing the zero-valued similarities, but that will still leave a

lot of non-zero similarities to deal with. Good luck.<br><br>

-bruce<br><br>

At 10:45 AM 12/22/2004, Ralf Steinberger wrote:<br>

<blockquote type=cite class=cite cite=""><font face="arial" size=2>We are

facing the task of having to find duplicate and near-duplicate documents

in a collection of about 1 million texts. Can anyone give us advice on

how to approach this challenge? <br>

</font> <br>

<font face="arial" size=2>The documents are in various formats (html,

PDF, MS-Word, plain text, ...) so that we intend to first convert them to

plain text. It is possible that the same text is present in the document

collection in different formats.<br>

</font> <br>

<font face="arial" size=2>For smaller collections, we identify

(near)-duplicates by applying hierarchical clustering techniques, but

with this approach, we are limited to a few thousand documents. <br>

</font> <br>

<font face="arial" size=2>Any pointers are welcome. Thank you.<br>

</font> <br>

<font face="arial" size=2>Ralf Steinberger<br>

European Commission - Joint Research Centre<br>

<a href="http://www.jrc.it/langtech">http://www.jrc.it/langtech</a><br>

</font> </blockquote></body>

</html>