[Corpora-List] Reference corpora for IE available
Alberto Lavelli
lavelli at itc.it
Wed Oct 20 16:31:08 UTC 2004
In the following page
http://nlp.shef.ac.uk/dot.kom/resources.html
two datasets for Information Extraction (IE) are made available:
- a new "corrected" version of the Seminar Announcements dataset
(below more details about the changes with respect to the version
available in the RISE repository [1]);
- the first publicly available version of the Corporate Acquisitions
dataset.
A link to the page above will soon be available in the RISE
repository. This effort is part of an activity related to the
evaluation methodology for IE [2] carried on by Mary Elaine Califf
(Illinois State University), Fabio Ciravegna (University of
Sheffield), Dayne Freitag (Fair Isaac Corporation), Nick Kushmerick
(University College Dublin), and the Dot.Kom group at ITC-irst (i.e.,
Claudio Giuliano, Alberto Lavelli and Lorenza Romano). This effort
has been carried on within the Dot.Kom EU project
(http://www.dot-kom.org).
Seminar Announcements
Main changes with respect to version v1.0 (i.e., the RISE version):
- obvious annotation errors were corrected
- the Windows convention of naming files was adopted. It appears
that under some versions of Windows there are problems with the
presence of certain characters (e.g., ":") in filenames. To solve
the problems, we substituted ":" with "_".
- all <sentence> and <paragraph> tags were stripped from the corpus
- the documents were made XML-compliant
Corporate Acquisitions
The documents are XML-compliant. Please, note that this dataset was
not available in the RISE repository.
References
\[1] RISE. A Repository of Online Information Sources Used in
Information Extraction Tasks Information Sciences Institute / USC,
1998.
\[2] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne
Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano. IE
evaluation: Criticisms and recommendations. In Proceedings of the
AAAI-04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004),
San Jose, California, 26 July 2004.
More information about the Corpora
mailing list