[Corpora-List] Gold standard for document similarity
Tony Russell-Rose
tgr at russellrose.com
Wed Mar 5 10:13:55 UTC 2014
A few years ago Adam Kilgarriff & I wrote a paper evaluating various
metrics for comparing corpora, and as part of that process created a set
of 'known similarity corpora' which included various newspaper sources.
It's documented here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
Not sure we still have the data but it shouldn't be too difficult to
recreate (feel free to contact me offline)
HTH,
Tony
--
-------------------------------
Tony Russell-Rose PhD FBCS CITP
Vice-chair, BCS IRSG
Chair, IEHF HCI Group
http://uxlabs.co.uk
http://isquared.wordpress.com
On 04/03/2014 15:48, Ivelina Nikolova wrote:
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140305/ca22e653/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list