[Corpora-List] Gold standard for document similarity

Tony Russell-Rose tgr at russellrose.com
Wed Mar 5 10:13:55 UTC 2014


A few years ago Adam Kilgarriff & I wrote a paper evaluating various 
metrics for comparing corpora, and as part of that process created a set 
of 'known similarity corpora' which included various newspaper sources.  
It's documented here:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716

Not sure we still have the data but it shouldn't be too difficult to 
recreate (feel free to contact me offline)

HTH,
Tony
-- 
-------------------------------
Tony Russell-Rose PhD FBCS CITP
Vice-chair, BCS IRSG
Chair, IEHF HCI Group
http://uxlabs.co.uk
http://isquared.wordpress.com

On 04/03/2014 15:48, Ivelina Nikolova wrote:
> Dear corpora members,
>
> I am looking for a gold standard to train/evaluate document similarity 
> metrics.
> Can anyone suggest a suitable corpus for such purposes. I'm especially 
> interested in similarity between newspaper articles.
>
> Thanks in advance,
> Ivelina
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140305/ca22e653/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list