[Corpora-List] 500K MASC Release Candidate Available for download
Nancy Ide
ide at cs.vassar.edu
Mon Oct 15 17:40:45 UTC 2012
****************************************************
Manually Annotated Sub-Corpus (MASC)
Release Candidate Version
www.anc.org/MASC/download/MASC-3.0.0-RC1.tgz (.zip)
*****************************************************
All Open ANC and MASC data and annotations are freely downloadable for any use
(including commercial).
The American National Corpus project has produced a "release candidate" of the full 500K
Manually Annotated Sub-Corpus (MASC), which is available for download from the ANC site
(www.anc.org/download/MASC-3.0.0-RC1.tgz or .zip). The final release, which will include
full documentation and enhanced tool support, will be available by mid-November. The final
release will also be freely distributed through the Linguistic Data Consortium.
The release candidate includes the 82K MASC I, released in 2010, which is fully documented at
www.anc.org/MASC. The full MASC includes a 500K balanced set of nineteen genres of written
and spoken American English data annotated for logical structure (paragraph, headings, etc.), token
and sentence boundaries, part of speech and lemma, shallow parse (noun and verb chunks), and
named entities (person, organization, location, date). Portions of the corpus are also annotated for
FrameNet frames (40K full text), Penn Treebank syntax (82K), and Opinion (50K). All annotations
are either manually produced or hand-validated, and represented in ISO-GrAF standoff format.
The MASC I Sentence Corpus containing WordNet 3.1 sense annotations of 1000 occurrences for 50
words, accompanied by inter-annotator agreement measures, is available for download from the MASC
site. The complete Sentence Corpus, including annotations of 1000 occurrences for 114 words and
complementary annotation of 100 sentences per word for FrameNet frames will be available by
the end of the year.
Co-reference annotation of the full MASC will also be added by the end of the year. Penn Treebank
syntax for the remaining 418K of the corpus will be available in late spring, 2013. Currently, PropBank
annotations of 50K of the corpus are available in their original format. TimeML annotations of the same
50K are near completion. Both PropBank and TimeML annotations will be made available in ISO-GrAF
format.
MultiMASC
************
We are currently seeking community members who will develop open corpora in their own languages
that are comparable to MASC in composition and ultimately, annotations. Please see Ide, N. (2012).
MultiMASC: An Open Linguistic Infrastructure for Language Research. Proceedings of the Fifth Workshop
on Building and Using Comparable Corpora. Contact anc at anc.org if you are interested in contributing to
MultiMASC.
******************************************************************************************************
MASC is a collaborative community effort and we welcome contributions of annotations in any
format and/or data, as well as feedback on the resource.
******************************************************************************************************
The American National Corpus Project
Department of Computer Science, Vassar College, New York, USA
email: anc at anc.org • web: www.anc.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121015/77a1ab50/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list