[Corpora-List] MASC data and annotations available for download

Nancy Ide ide at cs.vassar.edu
Sat Jul 24 17:27:55 UTC 2010


                      MANUALLY ANNOTATED SUB-CORPUS (MASC)
                      -------------------------------------------------------------
                                    http://www.anc.org/MASC

                     Version 1.02 (July 2010) available for download
                     -----------------------------------------------------------

MASC 1.02 contains 82K words of contemporary written and spoken American English
across a broad range of genres. The entire corpus is annotated for logical structure, 
tokens (3 versions) and part of speech (2 versions), sentence boundaries, noun chunks, 
verbchunks, Penn Treebank syntax, and named entities. Other annotations include FrameNet frames 
and frame elements and Opinion; annotations for TimeBank, PropBank, HPSG, co-reference, event,
and Discourse are in process. All MASC annotations are manually-produced or hand-validated.

MASC 1.02 also includes a separate "sentence corpus" including 1000 sentences for each of 50 words, 
manually annotated for WordNet 3.1* senses by several taggers and including inter-annotator agreement
statistics. One-hundred of the 1000 sentences for each word are currently being annotated for FrameNet 
frames and frame elements. WordNet and FrameNet annotations for an additional 50 words are forthcoming.

All MASC annotations are distributed in the ISO TC37 SC4 GrAF standoff format. The ANC2Go web
application can be used to obtain the annotations in a number of other formats, including 
in-line XML (XCES), token/pos, simple NLTK, and CONLL formats. Tools to import and export 
GrAF annotations into and out of GATE and UIMA are also available for download from the MASC and
ANC websites.

ALL MASC DATA AND ANNOTATIONS ARE FREELY DISTRIBUTED FOR RESEARCH AND 
COMMERCIAL USE.

The full MASC, to be released in fall, 2011, will contain 500K words of data with annotations. MASC 2, 
available in December, 2010, contains an additional 140K words with annotations. 

MASC 1 and 2 texts are available for separate download to enable others to annotate the data and 
contribute the annotations to this community-developed resource. MASC 3 texts will be available this fall.

==============================================================================
We invite contributions of linguistic annotations of any portion of MASC data, in any format.
We also invite contributions of unencumbered texts for inclusion in MASC and/or the Open American
National Corpus.
==============================================================================

Please consult the MASC website (http://www.anc.org/MASC) or contact anc at anc.org for additional 
information. See also:

Ide, Nancy; Baker, Collin; Fellbaum, Christiane; and Passonneau, Rebecca (2010). 
MASC: A Community Resource For and By the People. Proceedings of the 48th Annual 
Conference of the Association for Computational Linguistics, Uppsala, Sweden. 
http://aclweb.org/anthology-new/P/P10/P10-2013.pdf

References to additional MASC publications, including inter-annotator agreement studies, 
are available on the MASC website at http://www.anc.org/MASC/Publications.html.

----------------------------------------------------------------------------------------------------------------------------
* Yes, we mean 3.1. Please see the MASC website or the ACL 2010 paper cited above.
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list