[Corpora-List] Similarity between documents

Dom Widdows widdows at google.com
Mon Mar 23 12:39:18 UTC 2009


Semantic Vectors is one package (along with Lucene) that will do
document comparisons for you relatively easily.
See http://code.google.com/p/semanticvectors/
in particular http://code.google.com/p/semanticvectors/wiki/DocumentSearch

So it depends partly on what you mean by "available online"- if it's
software you're after to built your own retrieval model, these are
good options. If you want an online document comparison service,
that's a different matter: for any but the simplest term-overlap
measure, any document comparison engine will depend to some extent on
the corpus the model is trained on. If you really really want the
basic term-overlap measure, I'd whip it up in python.

Best wishes,
Dominic

On Sun, Mar 22, 2009 at 12:39 AM, Max CHEVALIER <Max.Chevalier at irit.fr> wrote:
>> Dear All
>>
>> *Someone knows any script available online to determine the similarity
>> (cosine angle) documents ?*
>>
>> Best regards
>>
>> J.R. Colt Clint
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
> The cosine measure is well known in IR field. It is integrated to Lucene
> or other search engines.
> You can find its definition in every books related to IR (Modern
> Information Retrieval - Baeza-Yates & Ricardo Neto-
> http://people.ischool.berkeley.edu/~hearst/irbook/) and Text Data Mining.
> It is really simple to implement.
>
> You also can find some relevant Java source at
> http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
> with many other similarity measures. Note that I did not test it....
>
> Best regards,
>
> Max CHEVALIER.
> ---------------------------------
> IRIT - Toulouse
> France
> http://www.irit.fr/~Max.Chevalier
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list