<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<font face="Calibri">A few years ago Adam Kilgarriff & I wrote a
paper evaluating various metrics for comparing corpora, and as
part of that process created a set of 'known similarity corpora'
which included various newspaper sources. It's documented here:<br>
<br>
<a class="moz-txt-link-freetext" href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716</a><br>
<br>
Not sure we still have the data but it shouldn't be too difficult
to recreate (feel free to contact me offline)<br>
<br>
HTH,<br>
Tony</font><br>
<font face="Calibri">-- <br>
-------------------------------
<br>
Tony Russell-Rose PhD FBCS CITP
<br>
Vice-chair, BCS IRSG
<br>
Chair, IEHF HCI Group
<br>
<a href="http://uxlabs.co.uk">http://uxlabs.co.uk</a>
<br>
<a href="http://isquared.wordpress.com">http://isquared.wordpress.com</a>
<br>
<br>
</font>
<div class="moz-cite-prefix">On 04/03/2014 15:48, Ivelina Nikolova
wrote:<br>
</div>
<blockquote cite="mid:5315F5B8.6080805@lml.bas.bg" type="cite">Dear
corpora members,
<br>
<br>
I am looking for a gold standard to train/evaluate document
similarity metrics.
<br>
Can anyone suggest a suitable corpus for such purposes. I'm
especially interested in similarity between newspaper articles.
<br>
<br>
Thanks in advance,
<br>
Ivelina
<br>
<br>
</blockquote>
<br>
</body>
</html>