<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <div class="moz-cite-prefix">Thanks to everyone who replied to my
      post! <br>
      I've compiled a summary of the answers which you can see below.<br>
      <br>
      General comment: Comparatively few similarity datasets above the
      sentence level exist. <br>
      <br>
      Resources:<br>
      <br>
      1. Lee & Pincombe's dataset:<br>
      Michael D. Lee, Brandon Pincombe, and Matthew<br>
      Welsh. 2005. An empirical evaluation of models of<br>
      text document similarity. In Proceedings of the 27th<br>
      Annual Conference of the Cognitive Science Society,<br>
      pages 1254--1259, Mahwah, NJ. Erlbaum.<br>
      <br>
      These are human graded similarities between paragraph sized texts.
      Need to contact Michael Lee to get access to it.<br>
      Contact: Michael D. Lee <a class="moz-txt-link-rfc2396E" href="mailto:mdlee@uci.edu"><mdlee@uci.edu></a><br>
      <br>
      2. Linda Bawcom's observations:<br>
      1) much of the similarity is caused by so many newspapers using
      the same agency (mostly Reuters and Associated Press -in the
      United States) to get their news and<br>
      2) she used a free online similarity program (really one that is
      normally used for plagiarism) to find that similarity:<br>
<a class="moz-txt-link-freetext" href="http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/">http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/</a>.<br>
      She prepared а corpus on TSUNAMI-related topics<br>
      <br>
      Contact: Linda Bawcom <a class="moz-txt-link-rfc2396E" href="mailto:linda.bawcom@sbcglobal.net"><linda.bawcom@sbcglobal.net></a><br>
      <br>
      3. SemEval Text Similarity task 2013<br>
 <a class="moz-txt-link-freetext" href="http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54">http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54</a><br>
      <br>
      - Core task - Given two sentences, s1 and s2, participants will
      quantifiably inform us on how similar s1 and s2 are, resulting in
      a similarity score.<br>
      - Pilot task on typed-similarity between semi-structured records.
      The types of similarity to be studied include location, author,
      people involved, time, events or actions, subject, description.<br>
      Data is available here:
<a class="moz-txt-link-freetext" href="http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56">http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56</a><br>
      <br>
      Contact: "Zesch, Torsten, Dr." <a class="moz-txt-link-rfc2396E" href="mailto:torsten.zesch@uni-due.de"><torsten.zesch@uni-due.de></a><br>
      <br>
      4. 20 newsgroups<br>
       <a class="moz-txt-link-freetext" href="http://qwone.com/~jason/20Newsgroups/">http://qwone.com/~jason/20Newsgroups/</a><br>
      <br>
      The 20 Newsgroups data set is a collection of approximately 20,000
      newsgroup documents, partitioned (nearly) evenly across 20
      different newsgroups. To the best of my knowledge, it was
      originally collected by Ken Lang, probably for his Newsweeder:
      Learning to filter netnews paper, though he does not explicitly
      mention this collection. The 20 newsgroups collection has become a
      popular data set for experiments in text applications of machine
      learning techniques, such as text classification and text
      clustering.<br>
      <br>
      5. Reuters corpus<br>
<a class="moz-txt-link-freetext" href="http://about.reuters.com/researchandstandards/corpus/statistics/index.asp">http://about.reuters.com/researchandstandards/corpus/statistics/index.asp</a><br>
      <br>
      6. Adam Kilgarriff & Tony Russell-Rose wrote a paper
      evaluating various metrics for comparing corpora, and as part of
      that process created a set of 'known similarity corpora' which
      included various newspaper sources. It's documented here:<br>
      Measures for corpus similarity and homogeneity
      <a class="moz-txt-link-freetext" href="http://aclweb.org/anthology//W/W98/W98-1506.pdf">http://aclweb.org/anthology//W/W98/W98-1506.pdf</a><br>
      The documents are here: <a class="moz-txt-link-freetext" href="ftp://ftp.itri.brighton.ac.uk/KSC">ftp://ftp.itri.brighton.ac.uk/KSC</a><br>
      The METER Corpus is here: <a class="moz-txt-link-freetext" href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a><br>
      <br>
      Contacts: Tony Russell-Rose <a class="moz-txt-link-rfc2396E" href="mailto:tgr@russellrose.com"><tgr@russellrose.com></a>, Paul D
      Clough <a class="moz-txt-link-rfc2396E" href="mailto:p.d.clough@sheffield.ac.uk"><p.d.clough@sheffield.ac.uk></a><br>
      <br>
      7. JRC resources<br>
      - JEX corpus, which accompanies the JEC software
      (<a class="moz-txt-link-freetext" href="http://ipsc.jrc.ec.europa.eu/index.php?id=60">http://ipsc.jrc.ec.europa.eu/index.php?id=60</a>)<br>
      - The news clusters downloaded and annotated for multi-document
      summarisation (see at the bottom of the page
      <a class="moz-txt-link-freetext" href="http://ipsc.jrc.ec.europa.eu/?id=61">http://ipsc.jrc.ec.europa.eu/?id=61</a>). <br>
      - NewsExplorer news clusters (e.g.
      <a class="moz-txt-link-freetext" href="http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html">http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html</a>). <br>
      <br>
      Contacts: Ralf Steinberger
      <a class="moz-txt-link-rfc2396E" href="mailto:ralf.steinberger@jrc.ec.europa.eu"><ralf.steinberger@jrc.ec.europa.eu></a><br>
      <br>
      8. Recent publications on the topic<br>
      Daniel Baer's PhD Thesis:
      <a class="moz-txt-link-freetext" href="http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf">http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf</a><br>
      <br>
      <br>
      --Ivelina<br>
      <br>
      <pre class="moz-signature" cols="72">-- 
Ivelina Nikolova
PhD student in Computer Science
Linguistic Modelling Department
Institute of Information and Communication Technologies
Bulgarian Academy of Sciences</pre>
      <br>
      <br>
      <br>
      <br>
      On 03/05/2014 04:23 PM, Paul D Clough wrote:<br>
    </div>
    <blockquote
cite="mid:CAFixc5S-B3x8Q6Sm4OGk0L57HpmwKf4tbOkofDaT-Rc56Fa+xg@mail.gmail.com"
      type="cite">
      <div dir="ltr">Hi, for research purposes there is the METER
        Corpus: <a moz-do-not-send="true"
          href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a>.
        Let me know if you want a copy. I helped create the corpus to
        assess methods for detecting text reuse.
        <div>
          <br>
        </div>
        <div style="">Paul.</div>
        <div style=""><br>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On 5 March 2014 10:13, Tony
          Russell-Rose <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:tgr@russellrose.com" target="_blank">tgr@russellrose.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div text="#000000" bgcolor="#FFFFFF"> <font face="Calibri">A
                few years ago Adam Kilgarriff & I wrote a paper
                evaluating various metrics for comparing corpora, and as
                part of that process created a set of 'known similarity
                corpora' which included various newspaper sources.  It's
                documented here:<br>
                <br>
                <a moz-do-not-send="true"
                  href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716"
                  target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716</a><br>
                <br>
                Not sure we still have the data but it shouldn't be too
                difficult to recreate (feel free to contact me offline)<br>
                <br>
                HTH,<br>
                Tony</font><br>
              <font face="Calibri">-- <br>
                ------------------------------- <br>
                Tony Russell-Rose PhD FBCS CITP <br>
                Vice-chair, BCS IRSG <br>
                Chair, IEHF HCI Group <br>
                <a moz-do-not-send="true" href="http://uxlabs.co.uk"
                  target="_blank">http://uxlabs.co.uk</a> <br>
                <a moz-do-not-send="true"
                  href="http://isquared.wordpress.com" target="_blank">http://isquared.wordpress.com</a>
                <br>
                <br>
              </font>
              <div>On 04/03/2014 15:48, Ivelina Nikolova wrote:<br>
              </div>
              <blockquote type="cite">Dear corpora members, <br>
                <br>
                I am looking for a gold standard to train/evaluate
                document similarity metrics. <br>
                Can anyone suggest a suitable corpus for such purposes.
                I'm especially interested in similarity between
                newspaper articles. <br>
                <br>
                Thanks in advance, <br>
                Ivelina <br>
                <br>
              </blockquote>
              <br>
            </div>
            <br>
            _______________________________________________<br>
            UNSUBSCRIBE from this page: <a moz-do-not-send="true"
              href="http://mailman.uib.no/options/corpora"
              target="_blank">http://mailman.uib.no/options/corpora</a><br>
            Corpora mailing list<br>
            <a moz-do-not-send="true" href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
            <a moz-do-not-send="true"
              href="http://mailman.uib.no/listinfo/corpora"
              target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
            <br>
          </blockquote>
        </div>
        <br>
        <br clear="all">
        <div><br>
        </div>
        -- <br>
        <div dir="ltr">-------------------------------------------------------------------------<br>
          Dr. Paul Clough
          <div>
            <div>Reader in Information Retrieval<br>
              <br>
              Information School<br>
              University of Sheffield<br>
              Regent Court<br>
              Sheffield S1 4DP<br>
              Tel: +44 (0)114 2222664<br>
              Fax: +44 (0)114 2780300<br>
              Email: <a moz-do-not-send="true"
                href="mailto:p.d.clough@sheffield.ac.uk" target="_blank">p.d.clough@sheffield.ac.uk</a><br>
              Web: <a moz-do-not-send="true"
                href="http://ir.shef.ac.uk/cloughie/" target="_blank">http://ir.shef.ac.uk/cloughie/</a><br>
-------------------------------------------------------------------------<br>
              <br>
              <br>
              <br>
            </div>
          </div>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
    </blockquote>
    <br>
    <br>
  </body>
</html>