[Corpora-List] New from LDC
Linguistic Data Consortium
ldc at ldc.upenn.edu
Tue Feb 1 21:39:54 UTC 2011
/New publications:/
**
*- ****ACE 2005 English SpatialML Annotations Version 2* * -*
*- **SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in
Multiple Languages* * -*
------------------------------------------------------------------------
(1) ACE 2005 English SpatialML Annotations Version 2
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T02>
was developed by researchers at The MITRE Corporation
<http://www.mitre.org/> and applies SpatialML tags to the English
newswire and broadcast training data annotated for entities, relations
and events in ACE 2005 Multilingual Training Corpus LDC2006T06
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T06>. This
second version eliminates a number of annotation inconsistencies and
errors identified in ACE 2005 English SpatialML Annotations LDC2008T03
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T03>. In
addition, the SpatialML annotation schema has been updated from version
2.0 to version 3.0.1; the revised annotation guidelines are included in
this release.
The ACE (Automatic Content Extraction) program focused on developing
automatic content extraction technology to support automatic processing
of human language in text form., specifically, entities, values,
temporal expressions, relations and events. SpatialML is a mark-up
language for representing spatial expressions in natural language
documents. It is intended to emulate earlier progress on time expression
such as TIMEX2 <http://fofoca.mitre.org/>, TimeML
<http://timeml.org/site/index.html>, and the 2005 ACE guidelines
<http://www.itl.nist.gov/iad/mig/tests/ace/2005/doc/ace05eval_official_results_20060110.html>.
SpatialML includes syntax for marking up PLACEs mentioned in text and
for linking them to data from gazetteers and other databases. LINKs are
used to express relations between places, and RLINKs to capture
trajectories for relative locations. To the extent possible, SpatialML
leverages ISO and other standards with the goal of making the scheme
compatible with existing and future corpora. SpatialML goes beyond these
schemes, however, in terms of providing a richer markup for natural
language that includes semantic features and relationships that allow
mapping to existing resources such as gazetteers. Such markup can be
useful for disambiguation, integration with mapping services and spatial
reasoning.
This corpus contains 210065 total words and 17821 unique words. Counts
of unique words can be found in doc/ldc_wordcount.csv which includes all
words that are not part of XML markup (e.g., without tag names,
attribute names or values). Unique words are counted by comparing case
insensitive transformations with preceding and trailing punctuation
stripped off. "Words" consisting solely of punctuation are discarded.
The principal change in the annotation schema is that "PATH" has been
generalized to "RLINK" for relative link. At the top level, there is now
a version attribute on the root SpatialML tag to capture which version
of SpatialML was used. A number of smaller changes have been made to the
annotation specification; these are listed in Section 2 of the updated
guidelines.
*
(2) SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in
Multiple Languages
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T01>
is a subset of OntoNotes Release 2.0 LDC2008T04
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04>
used in SemEval-2010 Task 1 <http://stel.ub.edu/semeval2010-coref/home>,
Coreference Resolution in Multiple Languages. OntoNotes Release 2.0
consists of roughly 500,000 words of English broadcast and newswire data
annotated with structural information (syntax and predicate argument
structure) and shallow semantics (word sense linked to an ontology and
coreference). This SemEval-2010 Task 1 release contains approximately
120,000 words extracted from the OntoNotes corpus and formatted for the
SemEval task.
SemEval (Semantic Evaluation) is an ongoing series of evaluations of
computational semantic analysis systems. The goal of SemEval-2010 Task 1
was to evaluate and compare automatic coreference resolution systems for
six languages (Catalan, Dutch, English, German, Italian and Spanish) in
four evaluation settings using four metrics. Further information about
Task 1 can be found on the task description website
<http://stel.ub.edu/semeval2010-coref/node/7>.
The data is divided into three sets: the development set which contains
39 documents, 741 sentences and 17,044 tokens; the training set which
contains 229 documents, 3,648 sentences and 79,060 tokens; and the test
set which contains 85 documents, 1,141 sentences and 24,206 tokens. The
complete material for training systems is the sum of the development and
training sets.
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in
Multiple Languages is distributed via web download.
This data is available at no charge. Non-members may request this data
by completing a copy of the LDC User Agreement for Non-Members
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.
The agreement can be faxed +1 215 573 2175 or scanned and emailed to
this address.
------------------------------------------------------------------------
Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium Phone: 1 (215) 573-1275
University of Pennsylvania Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110201/917a331e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list