[Corpora-List] Corpora for plagiarism and co-derivatives detection

Tue Jan 12 15:52:48 UTC 2010

For those interested in the topic of plagiarism and co-derivatives detection,
below some information about four corpora useful in the evaluation of methods
for automatic plagiarism detection, text re-use and co-detivarives  
analysis. One of them contains cases of monolingual plagiarism, two  
contain crosslingual cases and the last one is a mix of monolingual  
and crosslingual (http://users.dsic.upv.es/grupos/nle/downloads.html)

MONOLINGUAL

Co-derivatives corpus. It is composed of more than 20,000 documents
from Wikipedia in German, English, Hindi and Spanish (around 5,000
documents per language). For each language, 500 of the most
frequently consulted articles in Wikipedia have been considered as
pivot and ten of its revisions were downloaded, which compose the
set of co-derivatives. Note that the articles written in the
different languages are unrelated.

CROSS-LINGUAL

CLiPA corpus. This is a toy corpus composed of 5 original text
fragments (written in English) which have been plagiarised into
Spanish and Italian by multiple persons and machine translators.

CL-PL-09 corpus. The corpus includes texts in Dutch, English,
French, German, Polish, and Spanish. It is divided into two
sections: (i) comparable, with texts on the same topic extracted
from Wikipedia; and (ii) parallel, with texts extracted from the
JRC-Acquis corpus. In both cases, documents on the six languages are
included (be parallel or just on the same topic). The objective is
considering two of the most common cross-language plagiarism
detection tasks: detection of exact translations and detection of
related documents.

MIXED

PAN-PC-09 corpus. This corpus contains documents in which artificial
plagiarism has been inserted automatically. It includes a low
percentage of crosslingual plagiarism (from Spanish or German into
English). The corpus can be used to evaluate two kinds of plagiarism
detection tasks: (i) External plagiarism detection; and (ii)
Intrinsic plagiarism detection. In fact, this corpus was used in the
1st International Competition on Plagiarism Detection
(http://www.webis.de/pan-09) and composes the training set of the
2010 competition (http://pan.webis.de/)

---

Paolo Rosso
Natural Language Engineering Lab.
Dpto. Sistemas Informáticos y Computación
Universidad Politécnica Valencia, Spain
URL: http://users.dsic.upv.es/~prosso
email: prosso [at] dsic.upv.es
fax: +34 963877359
tel: +34 963877007 ext. 73571

----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora