[Corpora-List] Q: How to identify duplicates in a largedocument collection

Normand Peladeau peladeau at simstat.com
Wed Jan 5 13:00:35 UTC 2005


At 1/5/2005 05:59 AM, you wrote:
>Once the index construction is complete the lookup of
>(near) duplicates of a single document certainly takes almost no time.
>What actually takes 2 hours for 1.000.000 documents is the construction
>of the index and the computation of a complete similarity matrix (the
>output is certainly constrained by some minimum overlap ratio...) for
>all documents.

Sorry!  I thought you meant that it took 2 hours to find documents similar
to a single one once the index was created.  Indeed creating the initial
index can take several hours.  Once created, computing similarities should
be pretty fast.

Normand Peladeau
Provalis Research
www.simstat.com



More information about the Corpora mailing list