[Corpora-List] Gold standard for document similarity

Paul D Clough p.d.clough at sheffield.ac.uk
Wed Mar 5 14:23:52 UTC 2014


Hi, for research purposes there is the METER Corpus:
http://nlp.shef.ac.uk/meter/. Let me know if you want a copy. I helped
create the corpus to assess methods for detecting text reuse.

Paul.



On 5 March 2014 10:13, Tony Russell-Rose <tgr at russellrose.com> wrote:

>  A few years ago Adam Kilgarriff & I wrote a paper evaluating various
> metrics for comparing corpora, and as part of that process created a set of
> 'known similarity corpora' which included various newspaper sources.  It's
> documented here:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
>
> Not sure we still have the data but it shouldn't be too difficult to
> recreate (feel free to contact me offline)
>
> HTH,
> Tony
> --
> -------------------------------
> Tony Russell-Rose PhD FBCS CITP
> Vice-chair, BCS IRSG
> Chair, IEHF HCI Group
> http://uxlabs.co.uk
> http://isquared.wordpress.com
>
>  On 04/03/2014 15:48, Ivelina Nikolova wrote:
>
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>


-- 
-------------------------------------------------------------------------
Dr. Paul Clough
Reader in Information Retrieval

Information School
University of Sheffield
Regent Court
Sheffield S1 4DP
Tel: +44 (0)114 2222664
Fax: +44 (0)114 2780300
Email: p.d.clough at sheffield.ac.uk
Web: http://ir.shef.ac.uk/cloughie/
-------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140305/da98927e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list