[Corpora-List] New MASC data and annotations available

Nancy Ide ide at cs.vassar.edu
Thu Mar 24 20:34:17 UTC 2011


                                   Manually Annotated Sub-Corpus
                                 http://www.anc.org/MASC

              *** All downloads available at http://www/anc.org/MASC/Download.html ***
MASC1 (82K words with multiple layers of annotation) is also available from the Linguistic Data Consortium

MASC texts
--------------
The full 500K of MASC spoken and written texts are now available for download from the MASC website.
The corpus comprises roughly 25K words from each of 20 different genres:

Genre	           No. files        	        No. Words	        Pct corpus
Court transcript 	2	30052	6%
Debate transcript 	2	32325	6%
Email 	78	27642	6%
Essay 	7	25590	5%
Fiction 	5	31518	6%
Gov't documents 	5	24578	5%
Journal 	10	25635	5%
Letters 	40	23325	5%
Newspaper/newswire 	41	23545	5%
Non-fiction 	4	25182	5%
Spoken 	11	25783	5%
Technical 	7	25426	5%
Travel guides 	7	26708	5%
Twitter	2	24180	5%
Blog	21	28199	6%
ficlets	5	26299	5%
movie script	2	28240	6%
spam	110	23490	5%
jokes	16	26582	5%
TOTAL	375	504299	

***************************************************************************************************************
We invite contribution of linguistic annotations of any kind and in any format of any portion of the data.
Contributed annotations will be made available to the community in both their original format and in GrAF
format compatible with other annotations of the data.
***************************************************************************************************************

New Annotations
---------------------
We have also made available Propbank annotations of a 40K subset of MASC that has been heavily
annotated by multiple groups for many different linguistic phenomena. These are currently distributed in the 
original Propbank format (together with the Penn Treebank annotations on which they rely), The GrAF version 
of the Propbank annotations will be made available this summer.

+-----------------------------------------------------------------------------------------+
|  MASC IS DEVELOPED AND DISTRIBUTED BY THE AMERICAN NATIONAL CORPUS PROJECT, WHICH IS    |
|  COMMITTED TO PROVIDING OPEN DATA. ALL MASC DATA AND ANNOTATIONS ARE FREELY DISTRIBUTED |
|  AND MAY BE USED AND REDISTRIBUTED FOR ANY PURPOSE, INCLUDING COMMERCIAL.               |
+-----------------------------------------------------------------------------------------+


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20110324/3717b359/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list