[Corpora-List] Gold standard for document similarity

Adam Kilgarriff adam at lexmasterclass.com
Wed Mar 5 10:51:08 UTC 2014


Ivelina,

the resources that Tony mentions are still available, at
ftp://ftp.itri.brighton.ac.uk/KSC

All the best

Adam


On 5 March 2014 10:13, Tony Russell-Rose <tgr at russellrose.com> wrote:

>  A few years ago Adam Kilgarriff & I wrote a paper evaluating various
> metrics for comparing corpora, and as part of that process created a set of
> 'known similarity corpora' which included various newspaper sources.  It's
> documented here:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
>
> Not sure we still have the data but it shouldn't be too difficult to
> recreate (feel free to contact me offline)
>
> HTH,
> Tony
> --
> -------------------------------
> Tony Russell-Rose PhD FBCS CITP
> Vice-chair, BCS IRSG
> Chair, IEHF HCI Group
> http://uxlabs.co.uk
> http://isquared.wordpress.com
>
>  On 04/03/2014 15:48, Ivelina Nikolova wrote:
>
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director                                    Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow                 University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

                        *DANTE: a lexical database for English
<http://www.webdante.com>                  *
========================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140305/323ff905/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list