[Corpora-List] Looking for sentence similarity corpus

David Evans devans at cs.columbia.edu
Thu Sep 23 14:23:00 UTC 2004


> Dear Corpora members,
>
> I'm looking for a sentence-similarity corpus, i.e., a collection of
> sentences with manually assigned similarities to other sentences. Any
> ideas?
>
> Thanks in advance,
> Gilad
>
>
> --
> Informatics Institute  *  University of Amsterdam
> Kruislaan 403  *  1098 SJ Amsterdam  *  The Netherlands
> http://ilps.science.uva.nl * +31 20 525 6731/7561/7490 (fax)

Hello Gilad,

  We have a small corpus like that at Columbia that we used to train
SimFinder.  It is a set of 8 clusters of documents, with similar
sentences marked within the clusters.  The sentences were marked by two
people, and they later adjudicated their markup until the two judges
agreed on the annotation.  Sentences are marked as either similar or not
similar.  There is a total of 34 articles over the 8 clusters, the
entire training set has about 20,000 sentences, 480 of them are marked
as similar.

  Let me know if you have any questions; I believe we can release this
data, but I might have to look into it a bit.

Dave



More information about the Corpora mailing list