[Corpora-List] 500K MASC Release Candidate Available for download

Alexander Osherenko osherenko at gmx.de
Thu Oct 18 16:25:23 UTC 2012


Nancy, I've read your paper about MultiMASC. Very interesting!

I wonder if MASC contains among other genres also a dialogue genre.

I've also seen in FAQ that ANC contains demographic information as age,
gender, national origin, and race. Can you point to any studies in this
field?

2012/10/15 Nancy Ide <ide at cs.vassar.edu>

>
> ****************************************************
>                                  *Manually Annotated Sub-Corpus (MASC)*
> *                                           Release Candidate Version*
>                       www.anc.org/MASC/download/MASC-3.0.0-RC1.tgz (.zip)
>
> *****************************************************
>
> *All Open ANC and MASC data and annotations are freely downloadable for
> any use *
> *                                              (including commercial).*
>
> The American National Corpus project has produced a "release candidate" of
> the full 500K
> Manually Annotated Sub-Corpus (MASC), which is available for download from
> the ANC site
> (www.anc.org/download/MASC-3.0.0-RC1.tgz  or .zip). The final release,
> which will include
> full documentation and enhanced tool support, will be available by
> mid-November. The final
> release will also be freely distributed through the Linguistic Data
> Consortium.* *
>
> The release candidate includes the 82K MASC I, released in 2010, which is
> fully documented at
> www.anc.org/MASC. The full MASC includes a 500K balanced set of nineteen
> genres of written
> and spoken American English data annotated for logical structure
> (paragraph, headings, etc.), token
> and sentence boundaries, part of speech and lemma, shallow parse (noun and
> verb chunks), and
> named entities (person, organization, location, date). Portions of the
> corpus are also annotated for
> FrameNet frames (40K full text), Penn Treebank syntax (82K), and Opinion
> (50K). All annotations
> are either manually produced or hand-validated, and represented in
> ISO-GrAF standoff format.
>
> The MASC I Sentence Corpus containing WordNet 3.1 sense annotations
> of 1000 occurrences for 50
> words, accompanied by inter-annotator agreement measures, is available for
> download from the MASC
> site. The complete Sentence Corpus, including annotations of 1000
> occurrences for 114 words and
> complementary annotation of 100 sentences per word for FrameNet frames
> will be available by
> the end of the year.
>
> Co-reference annotation of the full MASC will also be added by the end of
> the year. Penn Treebank
> syntax for the remaining 418K of the corpus will be available in late
> spring, 2013. Currently, PropBank
> annotations of 50K of the corpus are available in their original format.
> TimeML annotations of the same
> 50K are near completion. Both PropBank and TimeML annotations will be made
> available in ISO-GrAF
> format.
>
> MultiMASC
> ************
> We are currently seeking community members who will develop open corpora
> in their own languages
> that are comparable to MASC in composition and ultimately, annotations.
> Please see Ide, N. (2012).
> MultiMASC: An Open Linguistic Infrastructure for Language Research<http://www.cs.vassar.edu/~ide/papers/comparative.pdf>
> . *Proceedings of the Fifth Workshop *
> *on Building and Using Comparable Corpora*. Contact anc at anc.org if you
> are interested in contributing to
> MultiMASC.
>
>
> ******************************************************************************************************
> MASC is a *collaborative community effort *and we welcome contributions
> of annotations in any
> format and/or data, as well as feedback on the resource.
>
> ******************************************************************************************************
>
> *The American National Corpus Project*
> *Department of Computer Science, Vassar College, New York, USA*
> *email: anc at anc.org • **web: www.anc.org*
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121018/faedb45f/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list