[Corpora-List] MASC data and annotations available for download
Nancy Ide
ide at cs.vassar.edu
Sat Jul 24 17:27:55 UTC 2010
MANUALLY ANNOTATED SUB-CORPUS (MASC)
-------------------------------------------------------------
http://www.anc.org/MASC
Version 1.02 (July 2010) available for download
-----------------------------------------------------------
MASC 1.02 contains 82K words of contemporary written and spoken American English
across a broad range of genres. The entire corpus is annotated for logical structure,
tokens (3 versions) and part of speech (2 versions), sentence boundaries, noun chunks,
verbchunks, Penn Treebank syntax, and named entities. Other annotations include FrameNet frames
and frame elements and Opinion; annotations for TimeBank, PropBank, HPSG, co-reference, event,
and Discourse are in process. All MASC annotations are manually-produced or hand-validated.
MASC 1.02 also includes a separate "sentence corpus" including 1000 sentences for each of 50 words,
manually annotated for WordNet 3.1* senses by several taggers and including inter-annotator agreement
statistics. One-hundred of the 1000 sentences for each word are currently being annotated for FrameNet
frames and frame elements. WordNet and FrameNet annotations for an additional 50 words are forthcoming.
All MASC annotations are distributed in the ISO TC37 SC4 GrAF standoff format. The ANC2Go web
application can be used to obtain the annotations in a number of other formats, including
in-line XML (XCES), token/pos, simple NLTK, and CONLL formats. Tools to import and export
GrAF annotations into and out of GATE and UIMA are also available for download from the MASC and
ANC websites.
ALL MASC DATA AND ANNOTATIONS ARE FREELY DISTRIBUTED FOR RESEARCH AND
COMMERCIAL USE.
The full MASC, to be released in fall, 2011, will contain 500K words of data with annotations. MASC 2,
available in December, 2010, contains an additional 140K words with annotations.
MASC 1 and 2 texts are available for separate download to enable others to annotate the data and
contribute the annotations to this community-developed resource. MASC 3 texts will be available this fall.
==============================================================================
We invite contributions of linguistic annotations of any portion of MASC data, in any format.
We also invite contributions of unencumbered texts for inclusion in MASC and/or the Open American
National Corpus.
==============================================================================
Please consult the MASC website (http://www.anc.org/MASC) or contact anc at anc.org for additional
information. See also:
Ide, Nancy; Baker, Collin; Fellbaum, Christiane; and Passonneau, Rebecca (2010).
MASC: A Community Resource For and By the People. Proceedings of the 48th Annual
Conference of the Association for Computational Linguistics, Uppsala, Sweden.
http://aclweb.org/anthology-new/P/P10/P10-2013.pdf
References to additional MASC publications, including inter-annotator agreement studies,
are available on the MASC website at http://www.anc.org/MASC/Publications.html.
----------------------------------------------------------------------------------------------------------------------------
* Yes, we mean 3.1. Please see the MASC website or the ACL 2010 paper cited above.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list