[Corpora-List] New from LDC

Linguistic Data Consortium ldc at ldc.upenn.edu
Tue Feb 1 21:39:54 UTC 2011


/New publications:/
**

*- ****ACE 2005 English SpatialML Annotations Version 2* *  -*

*- **SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in 
Multiple Languages* *  -*

------------------------------------------------------------------------

(1) ACE 2005 English SpatialML Annotations Version 2 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T02> 
was developed by researchers at The MITRE Corporation 
<http://www.mitre.org/> and applies SpatialML tags to the English 
newswire and broadcast training data annotated for entities, relations 
and events in ACE 2005 Multilingual Training Corpus LDC2006T06 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T06>. This 
second version eliminates a number of annotation inconsistencies and 
errors identified in ACE 2005 English SpatialML Annotations LDC2008T03 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T03>. In 
addition, the SpatialML annotation schema has been updated from version 
2.0 to version 3.0.1; the revised annotation guidelines are included in 
this release.

The ACE (Automatic Content Extraction) program focused on developing 
automatic content extraction technology to support automatic processing 
of human language in text form., specifically, entities, values, 
temporal expressions, relations and events. SpatialML is a mark-up 
language for representing spatial expressions in natural language 
documents. It is intended to emulate earlier progress on time expression 
such as TIMEX2 <http://fofoca.mitre.org/>, TimeML 
<http://timeml.org/site/index.html>, and the 2005 ACE guidelines 
<http://www.itl.nist.gov/iad/mig/tests/ace/2005/doc/ace05eval_official_results_20060110.html>.

SpatialML includes syntax for marking up PLACEs mentioned in text and 
for linking them to data from gazetteers and other databases. LINKs are 
used to express relations between places, and RLINKs to capture 
trajectories for relative locations. To the extent possible, SpatialML 
leverages ISO and other standards with the goal of making the scheme 
compatible with existing and future corpora. SpatialML goes beyond these 
schemes, however, in terms of providing a richer markup for natural 
language that includes semantic features and relationships that allow 
mapping to existing resources such as gazetteers. Such markup can be 
useful for disambiguation, integration with mapping services and spatial 
reasoning.

This corpus contains 210065 total words and 17821 unique words. Counts 
of unique words can be found in doc/ldc_wordcount.csv which includes all 
words that are not part of XML markup (e.g., without tag names, 
attribute names or values). Unique words are counted by comparing case 
insensitive transformations with preceding and trailing punctuation 
stripped off. "Words" consisting solely of punctuation are discarded.

The principal change in the annotation schema is that "PATH" has been 
generalized to "RLINK" for relative link. At the top level, there is now 
a version attribute on the root SpatialML tag to capture which version 
of SpatialML was used. A number of smaller changes have been made to the 
annotation specification; these are listed in Section 2 of the updated 
guidelines.


*

(2) SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in 
Multiple Languages 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2011T01> 
is a subset of OntoNotes Release 2.0 LDC2008T04 
<http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2008T04> 
used in SemEval-2010 Task 1 <http://stel.ub.edu/semeval2010-coref/home>, 
Coreference Resolution in Multiple Languages. OntoNotes Release 2.0 
consists of roughly 500,000 words of English broadcast and newswire data 
annotated with structural information (syntax and predicate argument 
structure) and shallow semantics (word sense linked to an ontology and 
coreference). This SemEval-2010 Task 1 release contains approximately 
120,000 words extracted from the OntoNotes corpus and formatted for the 
SemEval task.

SemEval (Semantic Evaluation) is an ongoing series of evaluations of 
computational semantic analysis systems. The goal of SemEval-2010 Task 1 
was to evaluate and compare automatic coreference resolution systems for 
six languages (Catalan, Dutch, English, German, Italian and Spanish) in 
four evaluation settings using four metrics. Further information about 
Task 1 can be found on the task description website 
<http://stel.ub.edu/semeval2010-coref/node/7>.

The data is divided into three sets: the development set which contains 
39 documents, 741 sentences and 17,044 tokens; the training set which 
contains 229 documents, 3,648 sentences and 79,060 tokens; and the test 
set  which contains 85 documents, 1,141 sentences and 24,206 tokens. The 
complete material for training systems is the sum of the development and 
training sets.

SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in 
Multiple Languages is distributed via web download.

This data is available at no charge.  Non-members may request this data 
by completing a copy of the LDC User Agreement for Non-Members 
<http://www.ldc.upenn.edu/Membership/Agreements/licenses/genericlicense.pdf>.  
The agreement can be faxed +1 215 573 2175 or scanned and emailed to 
this address.
------------------------------------------------------------------------

Ilya Ahtaridis
Membership Coordinator
--------------------------------------------------------------------
Linguistic Data Consortium                  Phone: 1 (215) 573-1275
University of Pennsylvania                    Fax: 1 (215) 573-2175
3600 Market St., Suite 810ldc at ldc.upenn.edu
Philadelphia, PA 19104 USAhttp://www.ldc.upenn.edu

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110201/917a331e/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list