<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">Thanks to everyone who replied to my

      post! <br>

      I've compiled a summary of the answers which you can see below.<br>

      <br>

      General comment: Comparatively few similarity datasets above the

      sentence level exist. <br>

      <br>

      Resources:<br>

      <br>

      1. Lee & Pincombe's dataset:<br>

      Michael D. Lee, Brandon Pincombe, and Matthew<br>

      Welsh. 2005. An empirical evaluation of models of<br>

      text document similarity. In Proceedings of the 27th<br>

      Annual Conference of the Cognitive Science Society,<br>

      pages 1254--1259, Mahwah, NJ. Erlbaum.<br>

      <br>

      These are human graded similarities between paragraph sized texts.

      Need to contact Michael Lee to get access to it.<br>

      Contact: Michael D. Lee <a class="moz-txt-link-rfc2396E" href="mailto:mdlee@uci.edu"><mdlee@uci.edu></a><br>

      <br>

      2. Linda Bawcom's observations:<br>

      1) much of the similarity is caused by so many newspapers using

      the same agency (mostly Reuters and Associated Press -in the

      United States) to get their news and<br>

      2) she used a free online similarity program (really one that is

      normally used for plagiarism) to find that similarity:<br>

<a class="moz-txt-link-freetext" href="http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/">http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/</a>.<br>

      She prepared а corpus on TSUNAMI-related topics<br>

      <br>

      Contact: Linda Bawcom <a class="moz-txt-link-rfc2396E" href="mailto:linda.bawcom@sbcglobal.net"><linda.bawcom@sbcglobal.net></a><br>

      <br>

      3. SemEval Text Similarity task 2013<br>

 <a class="moz-txt-link-freetext" href="http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54">http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54</a><br>

      <br>

      - Core task - Given two sentences, s1 and s2, participants will

      quantifiably inform us on how similar s1 and s2 are, resulting in

      a similarity score.<br>

      - Pilot task on typed-similarity between semi-structured records.

      The types of similarity to be studied include location, author,

      people involved, time, events or actions, subject, description.<br>

      Data is available here:

<a class="moz-txt-link-freetext" href="http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56">http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56</a><br>

      <br>

      Contact: "Zesch, Torsten, Dr." <a class="moz-txt-link-rfc2396E" href="mailto:torsten.zesch@uni-due.de"><torsten.zesch@uni-due.de></a><br>

      <br>

      4. 20 newsgroups<br>

       <a class="moz-txt-link-freetext" href="http://qwone.com/~jason/20Newsgroups/">http://qwone.com/~jason/20Newsgroups/</a><br>

      <br>

      The 20 Newsgroups data set is a collection of approximately 20,000

      newsgroup documents, partitioned (nearly) evenly across 20

      different newsgroups. To the best of my knowledge, it was

      originally collected by Ken Lang, probably for his Newsweeder:

      Learning to filter netnews paper, though he does not explicitly

      mention this collection. The 20 newsgroups collection has become a

      popular data set for experiments in text applications of machine

      learning techniques, such as text classification and text

      clustering.<br>

      <br>

      5. Reuters corpus<br>

<a class="moz-txt-link-freetext" href="http://about.reuters.com/researchandstandards/corpus/statistics/index.asp">http://about.reuters.com/researchandstandards/corpus/statistics/index.asp</a><br>

      <br>

      6. Adam Kilgarriff & Tony Russell-Rose wrote a paper

      evaluating various metrics for comparing corpora, and as part of

      that process created a set of 'known similarity corpora' which

      included various newspaper sources. It's documented here:<br>

      Measures for corpus similarity and homogeneity

      <a class="moz-txt-link-freetext" href="http://aclweb.org/anthology//W/W98/W98-1506.pdf">http://aclweb.org/anthology//W/W98/W98-1506.pdf</a><br>

      The documents are here: <a class="moz-txt-link-freetext" href="ftp://ftp.itri.brighton.ac.uk/KSC">ftp://ftp.itri.brighton.ac.uk/KSC</a><br>

      The METER Corpus is here: <a class="moz-txt-link-freetext" href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a><br>

      <br>

      Contacts: Tony Russell-Rose <a class="moz-txt-link-rfc2396E" href="mailto:tgr@russellrose.com"><tgr@russellrose.com></a>, Paul D

      Clough <a class="moz-txt-link-rfc2396E" href="mailto:p.d.clough@sheffield.ac.uk"><p.d.clough@sheffield.ac.uk></a><br>

      <br>

      7. JRC resources<br>

      - JEX corpus, which accompanies the JEC software

      (<a class="moz-txt-link-freetext" href="http://ipsc.jrc.ec.europa.eu/index.php?id=60">http://ipsc.jrc.ec.europa.eu/index.php?id=60</a>)<br>

      - The news clusters downloaded and annotated for multi-document

      summarisation (see at the bottom of the page

      <a class="moz-txt-link-freetext" href="http://ipsc.jrc.ec.europa.eu/?id=61">http://ipsc.jrc.ec.europa.eu/?id=61</a>). <br>

      - NewsExplorer news clusters (e.g.

      <a class="moz-txt-link-freetext" href="http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html">http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html</a>). <br>

      <br>

      Contacts: Ralf Steinberger

      <a class="moz-txt-link-rfc2396E" href="mailto:ralf.steinberger@jrc.ec.europa.eu"><ralf.steinberger@jrc.ec.europa.eu></a><br>

      <br>

      8. Recent publications on the topic<br>

      Daniel Baer's PhD Thesis:

      <a class="moz-txt-link-freetext" href="http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf">http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf</a><br>

      <br>

      <br>

      --Ivelina<br>

      <br>

      <pre class="moz-signature" cols="72">-- 

Ivelina Nikolova

PhD student in Computer Science

Linguistic Modelling Department

Institute of Information and Communication Technologies

Bulgarian Academy of Sciences</pre>

      <br>

      <br>

      <br>

      <br>

      On 03/05/2014 04:23 PM, Paul D Clough wrote:<br>

    </div>

    <blockquote

cite="mid:CAFixc5S-B3x8Q6Sm4OGk0L57HpmwKf4tbOkofDaT-Rc56Fa+xg@mail.gmail.com"

      type="cite">

      <div dir="ltr">Hi, for research purposes there is the METER

        Corpus: <a moz-do-not-send="true"

          href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a>.

        Let me know if you want a copy. I helped create the corpus to

        assess methods for detecting text reuse.

        <div>

          <br>

        </div>

        <div style="">Paul.</div>

        <div style=""><br>

        </div>

      </div>

      <div class="gmail_extra"><br>

        <br>

        <div class="gmail_quote">On 5 March 2014 10:13, Tony

          Russell-Rose <span dir="ltr"><<a moz-do-not-send="true"

              href="mailto:tgr@russellrose.com" target="_blank">tgr@russellrose.com</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div text="#000000" bgcolor="#FFFFFF"> <font face="Calibri">A

                few years ago Adam Kilgarriff & I wrote a paper

                evaluating various metrics for comparing corpora, and as

                part of that process created a set of 'known similarity

                corpora' which included various newspaper sources.  It's

                documented here:<br>

                <br>

                <a moz-do-not-send="true"

                  href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716"

                  target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716</a><br>

                <br>

                Not sure we still have the data but it shouldn't be too

                difficult to recreate (feel free to contact me offline)<br>

                <br>

                HTH,<br>

                Tony</font><br>

              <font face="Calibri">-- <br>

                ------------------------------- <br>

                Tony Russell-Rose PhD FBCS CITP <br>

                Vice-chair, BCS IRSG <br>

                Chair, IEHF HCI Group <br>

                <a moz-do-not-send="true" href="http://uxlabs.co.uk"

                  target="_blank">http://uxlabs.co.uk</a> <br>

                <a moz-do-not-send="true"

                  href="http://isquared.wordpress.com" target="_blank">http://isquared.wordpress.com</a>

                <br>

                <br>

              </font>

              <div>On 04/03/2014 15:48, Ivelina Nikolova wrote:<br>

              </div>

              <blockquote type="cite">Dear corpora members, <br>

                <br>

                I am looking for a gold standard to train/evaluate

                document similarity metrics. <br>

                Can anyone suggest a suitable corpus for such purposes.

                I'm especially interested in similarity between

                newspaper articles. <br>

                <br>

                Thanks in advance, <br>

                Ivelina <br>

                <br>

              </blockquote>

              <br>

            </div>

            <br>

            _______________________________________________<br>

            UNSUBSCRIBE from this page: <a moz-do-not-send="true"

              href="http://mailman.uib.no/options/corpora"

              target="_blank">http://mailman.uib.no/options/corpora</a><br>

            Corpora mailing list<br>

            <a moz-do-not-send="true" href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>

            <a moz-do-not-send="true"

              href="http://mailman.uib.no/listinfo/corpora"

              target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

            <br>

          </blockquote>

        </div>

        <br>

        <br clear="all">

        <div><br>

        </div>

        -- <br>

        <div dir="ltr">-------------------------------------------------------------------------<br>

          Dr. Paul Clough

          <div>

            <div>Reader in Information Retrieval<br>

              <br>

              Information School<br>

              University of Sheffield<br>

              Regent Court<br>

              Sheffield S1 4DP<br>

              Tel: +44 (0)114 2222664<br>

              Fax: +44 (0)114 2780300<br>

              Email: <a moz-do-not-send="true"

                href="mailto:p.d.clough@sheffield.ac.uk" target="_blank">p.d.clough@sheffield.ac.uk</a><br>

              Web: <a moz-do-not-send="true"

                href="http://ir.shef.ac.uk/cloughie/" target="_blank">http://ir.shef.ac.uk/cloughie/</a><br>

-------------------------------------------------------------------------<br>

              <br>

              <br>

              <br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

</pre>

    </blockquote>

    <br>

    <br>

  </body>

</html>