[Corpora-List] Reference corpora for IE available

Alberto Lavelli lavelli at itc.it
Wed Oct 20 16:31:08 UTC 2004


In the following page

 http://nlp.shef.ac.uk/dot.kom/resources.html

two datasets for Information Extraction (IE) are made available:

 - a new "corrected" version of the Seminar Announcements dataset
   (below more details about the changes with respect to the version
   available in the RISE repository [1]);

 - the first publicly available version of the Corporate Acquisitions
   dataset.

A link to the page above will soon be available in the RISE
repository.  This effort is part of an activity related to the
evaluation methodology for IE [2] carried on by Mary Elaine Califf
(Illinois State University), Fabio Ciravegna (University of
Sheffield), Dayne Freitag (Fair Isaac Corporation), Nick Kushmerick
(University College Dublin), and the Dot.Kom group at ITC-irst (i.e.,
Claudio Giuliano, Alberto Lavelli and Lorenza Romano).  This effort
has been carried on within the Dot.Kom EU project
(http://www.dot-kom.org).


Seminar Announcements

Main changes with respect to version v1.0 (i.e., the RISE version):

 - obvious annotation errors were corrected

 - the Windows convention of naming files was adopted.  It appears
   that under some versions of Windows there are problems with the
   presence of certain characters (e.g., ":") in filenames.  To solve
   the problems, we substituted ":" with "_".

 - all <sentence> and <paragraph> tags were stripped from the corpus

 - the documents were made XML-compliant


Corporate Acquisitions

The documents are XML-compliant.  Please, note that this dataset was
not available in the RISE repository.


References

\[1] RISE. A Repository of Online Information Sources Used in
Information Extraction Tasks Information Sciences Institute / USC,
1998.

\[2] Alberto Lavelli, Mary Elaine Califf, Fabio Ciravegna, Dayne
Freitag, Claudio Giuliano, Nick Kushmerick, Lorenza Romano.  IE
evaluation: Criticisms and recommendations.  In Proceedings of the
AAAI-04 Workshop on Adaptive Text Extraction and Mining (ATEM-2004),
San Jose, California, 26 July 2004.



More information about the Corpora mailing list