[Corpora-List] Syntactic annotations and co-reference annotations now avilable for the Open American National Corpus (OANC)

Nancy Ide ide at cs.vassar.edu
Wed Nov 10 15:12:29 UTC 2010


              *******************************************************************
              Three syntactic annotations of 11 million words of the  Open ANC
              *******************************************************************

The American National Corpus (ANC) project has received a contribution of three syntactic parses
for 11 million of the 15 million words of the Open American National Corpus, which are now freely
available for download from the ANC website. The annotations were automatically produced
using the Charniak & Johnson (2005) parser, the MaltParser (Nivre et al., 2007), and the LHT dependency 
converter (Johansson & Nugues, 2007). The annotations were contributed by Rasul Kalajahi.

The download contains the input to and output from each parser, in Penn Treebank and CONLL formats. 
The ANC project is in the process of generating a version of these annotations in standoff GrAF
format so that they may be combined with other OANC annotations using the ANC2Go web
application http://www.anc.org:8080/ANC2Go) or the stand-alone ANCTool.

              ********************************************************************************
              Manually-generated coreference annotations of 128K words of the  Open ANC
              ********************************************************************************
Shane Bergsma of the University of Alberta has annotated a sub-set of the Slate journal data for coreference 
(anaphora). The annotations consist of pronoun-antecedent pairs in 118 documents (128717 words) from 
the Slate data of the ANC/OANC. The data include a test set and a training set; there are 1398 labeled 
pronouns in 78 documents in the training set and 1381 labeled pronouns in 40 documents in the test set.

At present these annotations are provided as a separate corpus in the standoff XCES format used for the ANC First 
and Second releases and the current version of the OANC (a release of he OANC in GrAF format, which will supersede the 
current XCES format, will be available at the end of this month). A GrAF version of the coreference annotations
is also being produced.

All annotations of the OANC are available at http://www.anc.org/annotations.html

------------------------------------------------------------------------------------------------------------------------------
The ANC welcomes contributions of annotations, texts, and derived data, which we release for
free download by the community from our website. ANC, OANC, and MASC data and annotations are
or will be also available through the Linguistic Data Consortium. To contribute, send email to
anc at anc.org or consult http://www.anc.org/contribute.html.

==============================================================================
THE ANC PROJECT IS COMMITTED TO OPEN DATA FOR LANGUAGE RESEARCH, DEVELOPMENT,
AND EDUCATION. ALL CONTRIBUTIONS OF BOTH DATA AND ANNOTATIONS SHOULD  BE
UNENCUMBERED BY LICENSING RESTRICTIONS. ALL CONTRIBUTIONS ARE MADE FREELY AVAILABLE
FOR USE BY THE COMMUNITY.
===============================================================================

NOTE: The reference link for the BBN Named Entity tagger given in the Nov. 9 notice concerning the release
of OANC annotations was incorrect. The correct link is http://www.aclweb.org/anthology-new/N/N04/N04-1043.pdf.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20101110/6ea58d30/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list