[Corpora-List] Summary - Thanks for the replies! -> Gold standard for document similarity

Fri Mar 7 13:52:56 UTC 2014

Thanks to everyone who replied to my post!
I've compiled a summary of the answers which you can see below.

General comment: Comparatively few similarity datasets above the 
sentence level exist.

Resources:

1. Lee & Pincombe's dataset:
Michael D. Lee, Brandon Pincombe, and Matthew
Welsh. 2005. An empirical evaluation of models of
text document similarity. In Proceedings of the 27th
Annual Conference of the Cognitive Science Society,
pages 1254--1259, Mahwah, NJ. Erlbaum.

These are human graded similarities between paragraph sized texts. Need 
to contact Michael Lee to get access to it.
Contact: Michael D. Lee <mdlee at uci.edu>

2. Linda Bawcom's observations:
1) much of the similarity is caused by so many newspapers using the same 
agency (mostly Reuters and Associated Press -in the United States) to 
get their news and
2) she used a free online similarity program (really one that is 
normally used for plagiarism) to find that similarity:
http://plagiarism.bloomfieldmedia.com/z-wordpress/2012/03/05/new-release-wcopyfind-4-1-1/.
She prepared ? corpus on TSUNAMI-related topics

Contact: Linda Bawcom <linda.bawcom at sbcglobal.net>

3. SemEval Text Similarity task 2013
  http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=47&Itemid=54

- Core task - Given two sentences, s1 and s2, participants will 
quantifiably inform us on how similar s1 and s2 are, resulting in a 
similarity score.
- Pilot task on typed-similarity between semi-structured records. The 
types of similarity to be studied include location, author, people 
involved, time, events or actions, subject, description.
Data is available here: 
http://ixa2.si.ehu.es/sts/index.php?option=com_content&view=article&id=49&Itemid=56

Contact: "Zesch, Torsten, Dr." <torsten.zesch at uni-due.de>

4. 20 newsgroups
  http://qwone.com/~jason/20Newsgroups/

The 20 Newsgroups data set is a collection of approximately 20,000 
newsgroup documents, partitioned (nearly) evenly across 20 different 
newsgroups. To the best of my knowledge, it was originally collected by 
Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, 
though he does not explicitly mention this collection. The 20 newsgroups 
collection has become a popular data set for experiments in text 
applications of machine learning techniques, such as text classification 
and text clustering.

5. Reuters corpus
http://about.reuters.com/researchandstandards/corpus/statistics/index.asp

6. Adam Kilgarriff & Tony Russell-Rose wrote a paper evaluating various 
metrics for comparing corpora, and as part of that process created a set 
of 'known similarity corpora' which included various newspaper sources. 
It's documented here:
Measures for corpus similarity and homogeneity 
http://aclweb.org/anthology//W/W98/W98-1506.pdf
The documents are here: ftp://ftp.itri.brighton.ac.uk/KSC
The METER Corpus is here: http://nlp.shef.ac.uk/meter/

Contacts: Tony Russell-Rose <tgr at russellrose.com>, Paul D Clough 
<p.d.clough at sheffield.ac.uk>

7. JRC resources
- JEX corpus, which accompanies the JEC software 
(http://ipsc.jrc.ec.europa.eu/index.php?id=60)
- The news clusters downloaded and annotated for multi-document 
summarisation (see at the bottom of the page 
http://ipsc.jrc.ec.europa.eu/?id=61).
- NewsExplorer news clusters (e.g. 
http://emm.newsexplorer.eu/NewsExplorer/home/en/latest.html).

Contacts: Ralf Steinberger <ralf.steinberger at jrc.ec.europa.eu>

8. Recent publications on the topic
Daniel Baer's PhD Thesis: 
http://tuprints.ulb.tu-darmstadt.de/3641/1/Thesis_Screen.pdf

--Ivelina

-- 
Ivelina Nikolova
PhD student in Computer Science
Linguistic Modelling Department
Institute of Information and Communication Technologies
Bulgarian Academy of Sciences

On 03/05/2014 04:23 PM, Paul D Clough wrote:
> Hi, for research purposes there is the METER Corpus: 
> http://nlp.shef.ac.uk/meter/. Let me know if you want a copy. I helped 
> create the corpus to assess methods for detecting text reuse.
>
> Paul.
>
>
>
> On 5 March 2014 10:13, Tony Russell-Rose <tgr at russellrose.com 
> <mailto:tgr at russellrose.com>> wrote:
>
>     A few years ago Adam Kilgarriff & I wrote a paper evaluating
>     various metrics for comparing corpora, and as part of that process
>     created a set of 'known similarity corpora' which included various
>     newspaper sources.  It's documented here:
>
>     http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.1716
>
>     Not sure we still have the data but it shouldn't be too difficult
>     to recreate (feel free to contact me offline)
>
>     HTH,
>     Tony
>     -- 
>     -------------------------------
>     Tony Russell-Rose PhD FBCS CITP
>     Vice-chair, BCS IRSG
>     Chair, IEHF HCI Group
>     http://uxlabs.co.uk
>     http://isquared.wordpress.com
>
>     On 04/03/2014 15:48, Ivelina Nikolova wrote:
>>     Dear corpora members,
>>
>>     I am looking for a gold standard to train/evaluate document
>>     similarity metrics.
>>     Can anyone suggest a suitable corpus for such purposes. I'm
>>     especially interested in similarity between newspaper articles.
>>
>>     Thanks in advance,
>>     Ivelina
>>
>
>
>     _______________________________________________
>     UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>     Corpora mailing list
>     Corpora at uib.no <mailto:Corpora at uib.no>
>     http://mailman.uib.no/listinfo/corpora
>
>
>
>
> -- 
> -------------------------------------------------------------------------
> Dr. Paul Clough
> Reader in Information Retrieval
>
> Information School
> University of Sheffield
> Regent Court
> Sheffield S1 4DP
> Tel: +44 (0)114 2222664
> Fax: +44 (0)114 2780300
> Email: p.d.clough at sheffield.ac.uk <mailto:p.d.clough at sheffield.ac.uk>
> Web: http://ir.shef.ac.uk/cloughie/
> -------------------------------------------------------------------------
>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140307/68518fe9/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora