[Corpora-List] 500K MASC Release Candidate Available for download

Nancy Ide ide at cs.vassar.edu
Mon Oct 15 17:40:45 UTC 2012


                          ****************************************************
                                 Manually Annotated Sub-Corpus (MASC)
                                           Release Candidate Version
                      www.anc.org/MASC/download/MASC-3.0.0-RC1.tgz (.zip)
                          *****************************************************

All Open ANC and MASC data and annotations are freely downloadable for any use 
                                              (including commercial).

The American National Corpus project has produced a "release candidate" of the full 500K 
Manually Annotated Sub-Corpus (MASC), which is available for download from the ANC site
(www.anc.org/download/MASC-3.0.0-RC1.tgz  or .zip). The final release, which will include
full documentation and enhanced tool support, will be available by mid-November. The final 
release will also be freely distributed through the Linguistic Data Consortium. 

The release candidate includes the 82K MASC I, released in 2010, which is fully documented at 
www.anc.org/MASC. The full MASC includes a 500K balanced set of nineteen genres of written 
and spoken American English data annotated for logical structure (paragraph, headings, etc.), token 
and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks), and 
named entities (person, organization, location, date). Portions of the corpus are also annotated for 
FrameNet frames (40K full text), Penn Treebank syntax (82K), and Opinion (50K). All annotations 
are either manually produced or hand-validated, and represented in ISO-GrAF standoff format.

The MASC I Sentence Corpus containing WordNet 3.1 sense annotations of 1000 occurrences for 50 
words, accompanied by inter-annotator agreement measures, is available for download from the MASC 
site. The complete Sentence Corpus, including annotations of 1000 occurrences for 114 words and 
complementary annotation of 100 sentences per word for FrameNet frames will be available by 
the end of the year.

Co-reference annotation of the full MASC will also be added by the end of the year. Penn Treebank 
syntax for the remaining 418K of the corpus will be available in late spring, 2013. Currently, PropBank 
annotations of 50K of the corpus are available in their original format. TimeML annotations of the same 
50K are near completion. Both PropBank and TimeML annotations will be made available in ISO-GrAF
format. 

MultiMASC
************
We are currently seeking community members who will develop open corpora in their own languages 
that are comparable to MASC in composition and ultimately, annotations. Please see Ide, N. (2012). 
MultiMASC: An Open Linguistic Infrastructure for Language Research. Proceedings of the Fifth Workshop 
on Building and Using Comparable Corpora. Contact anc at anc.org if you are interested in contributing to 
MultiMASC. 

******************************************************************************************************
MASC is a collaborative community effort and we welcome contributions of annotations in any
format and/or data, as well as feedback on the resource.
****************************************************************************************************** 

The American National Corpus Project
Department of Computer Science, Vassar College, New York, USA
email: anc at anc.org • web: www.anc.org



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121015/77a1ab50/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list