[Corpora-List] New MASC data and annotations available
Nancy Ide
ide at cs.vassar.edu
Thu Mar 24 20:34:17 UTC 2011
Manually Annotated Sub-Corpus
http://www.anc.org/MASC
*** All downloads available at http://www/anc.org/MASC/Download.html ***
MASC1 (82K words with multiple layers of annotation) is also available from the Linguistic Data Consortium
MASC texts
--------------
The full 500K of MASC spoken and written texts are now available for download from the MASC website.
The corpus comprises roughly 25K words from each of 20 different genres:
Genre No. files No. Words Pct corpus
Court transcript 2 30052 6%
Debate transcript 2 32325 6%
Email 78 27642 6%
Essay 7 25590 5%
Fiction 5 31518 6%
Gov't documents 5 24578 5%
Journal 10 25635 5%
Letters 40 23325 5%
Newspaper/newswire 41 23545 5%
Non-fiction 4 25182 5%
Spoken 11 25783 5%
Technical 7 25426 5%
Travel guides 7 26708 5%
Twitter 2 24180 5%
Blog 21 28199 6%
ficlets 5 26299 5%
movie script 2 28240 6%
spam 110 23490 5%
jokes 16 26582 5%
TOTAL 375 504299
***************************************************************************************************************
We invite contribution of linguistic annotations of any kind and in any format of any portion of the data.
Contributed annotations will be made available to the community in both their original format and in GrAF
format compatible with other annotations of the data.
***************************************************************************************************************
New Annotations
---------------------
We have also made available Propbank annotations of a 40K subset of MASC that has been heavily
annotated by multiple groups for many different linguistic phenomena. These are currently distributed in the
original Propbank format (together with the Penn Treebank annotations on which they rely), The GrAF version
of the Propbank annotations will be made available this summer.
+-----------------------------------------------------------------------------------------+
| MASC IS DEVELOPED AND DISTRIBUTED BY THE AMERICAN NATIONAL CORPUS PROJECT, WHICH IS |
| COMMITTED TO PROVIDING OPEN DATA. ALL MASC DATA AND ANNOTATIONS ARE FREELY DISTRIBUTED |
| AND MAY BE USED AND REDISTRIBUTED FOR ANY PURPOSE, INCLUDING COMMERCIAL. |
+-----------------------------------------------------------------------------------------+
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110324/3717b359/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list