<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Thanks to everyone who replied to my
post! <br>
I've compiled a summary of the answers which you can see below.<br>
<br>
General comment: Comparatively few similarity datasets above the
sentence level exist. <br>
<br>
Resources:<br>
<br>
1. Lee & Pincombe's dataset:<br>
Michael D. Lee, Brandon Pincombe, and Matthew<br>
Welsh. 2005. An empirical evaluation of models of<br>
text document similarity. In Proceedings of the 27th<br>
Annual Conference of the Cognitive Science Society,<br>
pages 1254--1259, Mahwah, NJ. Erlbaum.<br>
<br>
These are human graded similarities between paragraph sized texts.
Need to contact Michael Lee to get access to it.<br>
Contact: Michael D. Lee <a class="moz-txt-link-rfc2396E" href="mailto:mdlee@uci.edu"><mdlee@uci.edu></a><br>
<br>
2. Linda Bawcom's observations:<br>
1) much of the similarity is caused by so many newspapers using
the same agency (mostly Reuters and Associated Press -in the
United States) to get their news and<br>
2) she used a free online similarity program (really one that is
normally used for plagiarism) to find that similarity:<br>
<a class="moz-txt-link-freetext" href="http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/">http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/</a>.<br>
She prepared а corpus on TSUNAMI-related topics<br>
<br>
Contact: Linda Bawcom <a class="moz-txt-link-rfc2396E" href="mailto:linda.bawcom@sbcglobal.net"><linda.bawcom@sbcglobal.net></a><br>
<br>
3. SemEval Text Similarity task 2013<br>
<a class="moz-txt-link-freetext" href="http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54">http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54</a><br>
<br>
- Core task - Given two sentences, s1 and s2, participants will
quantifiably inform us on how similar s1 and s2 are, resulting in
a similarity score.<br>
- Pilot task on typed-similarity between semi-structured records.
The types of similarity to be studied include location, author,
people involved, time, events or actions, subject, description.<br>
Data is available here:
<a class="moz-txt-link-freetext" href="http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56">http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56</a><br>
<br>
Contact: "Zesch, Torsten, Dr." <a class="moz-txt-link-rfc2396E" href="mailto:torsten.zesch@uni-due.de"><torsten.zesch@uni-due.de></a><br>
<br>
4. 20 newsgroups<br>
<a class="moz-txt-link-freetext" href="http://qwone.com/~jason/20Newsgroups/">http://qwone.com/~jason/20Newsgroups/</a><br>
<br>
The 20 Newsgroups data set is a collection of approximately 20,000
newsgroup documents, partitioned (nearly) evenly across 20
different newsgroups. To the best of my knowledge, it was
originally collected by Ken Lang, probably for his Newsweeder:
Learning to filter netnews paper, though he does not explicitly
mention this collection. The 20 newsgroups collection has become a
popular data set for experiments in text applications of machine
learning techniques, such as text classification and text
clustering.<br>
<br>
5. Reuters corpus<br>
<a class="moz-txt-link-freetext" href="http://about.reuters.com/researchandstandards/corpus/statistics/index.asp">http://about.reuters.com/researchandstandards/corpus/statistics/index.asp</a><br>
<br>
6. Adam Kilgarriff & Tony Russell-Rose wrote a paper
evaluating various metrics for comparing corpora, and as part of
that process created a set of 'known similarity corpora' which
included various newspaper sources. It's documented here:<br>
Measures for corpus similarity and homogeneity
<a class="moz-txt-link-freetext" href="http://aclweb.org/anthology//W/W98/W98-1506.pdf">http://aclweb.org/anthology//W/W98/W98-1506.pdf</a><br>
The documents are here: <a class="moz-txt-link-freetext" href="ftp://ftp.itri.brighton.ac.uk/KSC">ftp://ftp.itri.brighton.ac.uk/KSC</a><br>
The METER Corpus is here: <a class="moz-txt-link-freetext" href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a><br>
<br>
Contacts: Tony Russell-Rose <a class="moz-txt-link-rfc2396E" href="mailto:tgr@russellrose.com"><tgr@russellrose.com></a>, Paul D
Clough <a class="moz-txt-link-rfc2396E" href="mailto:p.d.clough@sheffield.ac.uk"><p.d.clough@sheffield.ac.uk></a><br>
<br>
7. JRC resources<br>
- JEX corpus, which accompanies the JEC software
(<a class="moz-txt-link-freetext" href="http://ipsc.jrc.ec.europa.eu/index.php?id=60">http://ipsc.jrc.ec.europa.eu/index.php?id=60</a>)<br>
- The news clusters downloaded and annotated for multi-document
summarisation (see at the bottom of the page
<a class="moz-txt-link-freetext" href="http://ipsc.jrc.ec.europa.eu/?id=61">http://ipsc.jrc.ec.europa.eu/?id=61</a>). <br>
- NewsExplorer news clusters (e.g.
<a class="moz-txt-link-freetext" href="http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html">http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html</a>). <br>
<br>
Contacts: Ralf Steinberger
<a class="moz-txt-link-rfc2396E" href="mailto:ralf.steinberger@jrc.ec.europa.eu"><ralf.steinberger@jrc.ec.europa.eu></a><br>
<br>
8. Recent publications on the topic<br>
Daniel Baer's PhD Thesis:
<a class="moz-txt-link-freetext" href="http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf">http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf</a><br>
<br>
<br>
--Ivelina<br>
<br>
<pre class="moz-signature" cols="72">--
Ivelina Nikolova
PhD student in Computer Science
Linguistic Modelling Department
Institute of Information and Communication Technologies
Bulgarian Academy of Sciences</pre>
<br>
<br>
<br>
<br>
On 03/05/2014 04:23 PM, Paul D Clough wrote:<br>
</div>
<blockquote
cite="mid:CAFixc5S-B3x8Q6Sm4OGk0L57HpmwKf4tbOkofDaT-Rc56Fa+xg@mail.gmail.com"
type="cite">
<div dir="ltr">Hi, for research purposes there is the METER
Corpus: <a moz-do-not-send="true"
href="http://nlp.shef.ac.uk/meter/">http://nlp.shef.ac.uk/meter/</a>.
Let me know if you want a copy. I helped create the corpus to
assess methods for detecting text reuse.
<div>
<br>
</div>
<div style="">Paul.</div>
<div style=""><br>
</div>
</div>
<div class="gmail_extra"><br>
<br>
<div class="gmail_quote">On 5 March 2014 10:13, Tony
Russell-Rose <span dir="ltr"><<a moz-do-not-send="true"
href="mailto:tgr@russellrose.com" target="_blank">tgr@russellrose.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div text="#000000" bgcolor="#FFFFFF"> <font face="Calibri">A
few years ago Adam Kilgarriff & I wrote a paper
evaluating various metrics for comparing corpora, and as
part of that process created a set of 'known similarity
corpora' which included various newspaper sources. It's
documented here:<br>
<br>
<a moz-do-not-send="true"
href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716"
target="_blank">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716</a><br>
<br>
Not sure we still have the data but it shouldn't be too
difficult to recreate (feel free to contact me offline)<br>
<br>
HTH,<br>
Tony</font><br>
<font face="Calibri">-- <br>
------------------------------- <br>
Tony Russell-Rose PhD FBCS CITP <br>
Vice-chair, BCS IRSG <br>
Chair, IEHF HCI Group <br>
<a moz-do-not-send="true" href="http://uxlabs.co.uk"
target="_blank">http://uxlabs.co.uk</a> <br>
<a moz-do-not-send="true"
href="http://isquared.wordpress.com" target="_blank">http://isquared.wordpress.com</a>
<br>
<br>
</font>
<div>On 04/03/2014 15:48, Ivelina Nikolova wrote:<br>
</div>
<blockquote type="cite">Dear corpora members, <br>
<br>
I am looking for a gold standard to train/evaluate
document similarity metrics. <br>
Can anyone suggest a suitable corpus for such purposes.
I'm especially interested in similarity between
newspaper articles. <br>
<br>
Thanks in advance, <br>
Ivelina <br>
<br>
</blockquote>
<br>
</div>
<br>
_______________________________________________<br>
UNSUBSCRIBE from this page: <a moz-do-not-send="true"
href="http://mailman.uib.no/options/corpora"
target="_blank">http://mailman.uib.no/options/corpora</a><br>
Corpora mailing list<br>
<a moz-do-not-send="true" href="mailto:Corpora@uib.no">Corpora@uib.no</a><br>
<a moz-do-not-send="true"
href="http://mailman.uib.no/listinfo/corpora"
target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>
<br>
</blockquote>
</div>
<br>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">-------------------------------------------------------------------------<br>
Dr. Paul Clough
<div>
<div>Reader in Information Retrieval<br>
<br>
Information School<br>
University of Sheffield<br>
Regent Court<br>
Sheffield S1 4DP<br>
Tel: +44 (0)114 2222664<br>
Fax: +44 (0)114 2780300<br>
Email: <a moz-do-not-send="true"
href="mailto:p.d.clough@sheffield.ac.uk" target="_blank">p.d.clough@sheffield.ac.uk</a><br>
Web: <a moz-do-not-send="true"
href="http://ir.shef.ac.uk/cloughie/" target="_blank">http://ir.shef.ac.uk/cloughie/</a><br>
-------------------------------------------------------------------------<br>
<br>
<br>
<br>
</div>
</div>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
<br>
</body>
</html>