<div dir="ltr">Ivelina,<div><br></div><div>the resources that Tony mentions are still available, at <a href="ftp://ftp.itri.brighton.ac.uk/KSC">ftp://ftp.itri.brighton.ac.uk/KSC</a></div><div><br></div><div>All the best</div>

<div><br></div><div>Adam</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 5 March 2014 10:13, Tony Russell-Rose <span dir="ltr"><<a href="mailto:tgr@russellrose.com" target="_blank">tgr@russellrose.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div text="#000000" bgcolor="#FFFFFF">
    <font face="Calibri">A few years ago Adam Kilgarriff & I wrote a
      paper evaluating various metrics for comparing corpora, and as
      part of that process created a set of 'known similarity corpora'
      which included various newspaper sources.  It's documented here:<br>
      <br>
      <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716</a><br>
      <br>
      Not sure we still have the data but it shouldn't be too difficult
      to recreate (feel free to contact me offline)<br>
      <br>
      HTH,<br>
      Tony</font><span class="HOEnZb"><font color="#888888"><br>
    <font face="Calibri">-- <br>
      -------------------------------
      <br>
      Tony Russell-Rose PhD FBCS CITP
      <br>
      Vice-chair, BCS IRSG
      <br>
      Chair, IEHF HCI Group
      <br>
      <a href="http://uxlabs.co.uk" target="_blank">http://uxlabs.co.uk</a>
      <br>
      <a href="http://isquared.wordpress.com" target="_blank">http://isquared.wordpress.com</a>
      <br>
      <br>
    </font></font></span><div class="">
    <div>On 04/03/2014 15:48, Ivelina Nikolova
      wrote:<br>
    </div>
    <blockquote type="cite">Dear
      corpora members,
      <br>
      <br>
      I am looking for a gold standard to train/evaluate document
      similarity metrics.
      <br>
      Can anyone suggest a suitable corpus for such purposes. I'm
      especially interested in similarity between newspaper articles.
      <br>
      <br>
      Thanks in advance,
      <br>
      Ivelina
      <br>
      <br>
    </blockquote>
    <br>
  </div></div>

<br>_______________________________________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br></blockquote></div><br><br clear="all"><div><br></div>-- <br>========================================<br><a href="http://www.kilgarriff.co.uk/" target="_blank">Adam Kilgarriff</a>                  <a href="mailto:adam@lexmasterclass.com" target="_blank">adam@lexmasterclass.com</a>                                             <br>

Director                                    <a href="http://www.sketchengine.co.uk/" target="_blank">Lexical Computing Ltd</a>                <br>Visiting Research Fellow                 <a href="http://leeds.ac.uk" target="_blank">University of Leeds</a>     <div>

<i><font color="#006600">Corpora for all</font></i> with <a href="http://www.sketchengine.co.uk" target="_blank">the Sketch Engine</a>                 </div><div>                        <i><a href="http://www.webdante.com" target="_blank">DANTE: <font color="#009900">a lexical database for English</font></a><font color="#009900"> </font>                 </i><div>

========================================</div></div>
</div>