<div dir="ltr">Hi, for research purposes there is the METER Corpus: <a href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a>. Let me know if you want a copy. I helped create the corpus to assess methods for detecting text reuse.<div>

<br></div><div style>Paul.</div><div style><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 5 March 2014 10:13, Tony Russell-Rose <span dir="ltr"><<a href="mailto:tgr@russellrose.com" target="_blank">tgr@russellrose.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div text="#000000" bgcolor="#FFFFFF">

    <font face="Calibri">A few years ago Adam Kilgarriff & I wrote a

      paper evaluating various metrics for comparing corpora, and as

      part of that process created a set of 'known similarity corpora'

      which included various newspaper sources.  It's documented here:<br>

      <br>

      <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716" target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716</a><br>

      <br>

      Not sure we still have the data but it shouldn't be too difficult

      to recreate (feel free to contact me offline)<br>

      <br>

      HTH,<br>

      Tony</font><br>

    <font face="Calibri">-- <br>

      -------------------------------

      <br>

      Tony Russell-Rose PhD FBCS CITP

      <br>

      Vice-chair, BCS IRSG

      <br>

      Chair, IEHF HCI Group

      <br>

      <a href="http://uxlabs.co.uk" target="_blank">http://uxlabs.co.uk</a>

      <br>

      <a href="http://isquared.wordpress.com" target="_blank">http://isquared.wordpress.com</a>

      <br>

      <br>

    </font>

    <div>On 04/03/2014 15:48, Ivelina Nikolova

      wrote:<br>

    </div>

    <blockquote type="cite">Dear

      corpora members,

      <br>

      <br>

      I am looking for a gold standard to train/evaluate document

      similarity metrics.

      <br>

      Can anyone suggest a suitable corpus for such purposes. I'm

      especially interested in similarity between newspaper articles.

      <br>

      <br>

      Thanks in advance,

      <br>

      Ivelina

      <br>

      <br>

    </blockquote>

    <br>

  </div>

<br>_______________________________________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div dir="ltr">-------------------------------------------------------------------------<br>Dr. Paul Clough<div><div>Reader in Information Retrieval<br><br>

Information School<br>University of Sheffield<br>Regent Court<br>Sheffield S1 4DP<br>Tel: +44 (0)114 2222664<br>Fax: +44 (0)114 2780300<br>Email: <a href="mailto:p.d.clough@sheffield.ac.uk" target="_blank">p.d.clough@sheffield.ac.uk</a><br>

Web: <a href="http://ir.shef.ac.uk/cloughie/" target="_blank">http://ir.shef.ac.uk/cloughie/</a><br>-------------------------------------------------------------------------<br><br><br><br></div></div></div>

</div>