[Corpora-List] Gold standard for document similarity
Paul D Clough
p.d.clough at sheffield.ac.uk
Wed Mar 5 14:23:52 UTC 2014
Hi, for research purposes there is the METER Corpus:
http://nlp.shef.ac.uk/meter/. Let me know if you want a copy. I helped
create the corpus to assess methods for detecting text reuse.
Paul.
On 5 March 2014 10:13, Tony Russell-Rose <tgr at russellrose.com> wrote:
> A few years ago Adam Kilgarriff & I wrote a paper evaluating various
> metrics for comparing corpora, and as part of that process created a set of
> 'known similarity corpora' which included various newspaper sources. It's
> documented here:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
>
> Not sure we still have the data but it shouldn't be too difficult to
> recreate (feel free to contact me offline)
>
> HTH,
> Tony
> --
> -------------------------------
> Tony Russell-Rose PhD FBCS CITP
> Vice-chair, BCS IRSG
> Chair, IEHF HCI Group
> http://uxlabs.co.uk
> http://isquared.wordpress.com
>
> On 04/03/2014 15:48, Ivelina Nikolova wrote:
>
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
--
-------------------------------------------------------------------------
Dr. Paul Clough
Reader in Information Retrieval
Information School
University of Sheffield
Regent Court
Sheffield S1 4DP
Tel: +44 (0)114 2222664
Fax: +44 (0)114 2780300
Email: p.d.clough at sheffield.ac.uk
Web: http://ir.shef.ac.uk/cloughie/
-------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140305/da98927e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list