[Corpora-List] Comprehensive Multiword Expressions (CMWE) Corpus
Nathan Schneider
nschneid at cs.cmu.edu
Wed May 28 08:14:03 UTC 2014
Greetings,
The Comprehensive Multiword Expressions (CMWE) Corpus consists of 55,000
words of English web reviews that have been manually annotated for
heterogeneous multiword expressions.
Annotations are shallow but comprehensive: proceeding sentence by sentence,
our annotators grouped tokens into MWEs according to guidelines that cover
a broad range of multiword phenomena—including (but not limited to)
compound nominals, light verb constructions, verb-particle constructions,
prepositional verbs, and multiword named entities. 3,500 MWE instances are
marked, 500 of which are discontinuous (contain a gap). The annotation
scheme makes a qualitative distinction between "strong" (highly
idiosyncratic) and "weak" (loosely collocational) expressions.
For example,
I will sum_ it _up~with , it was worth_every_penny !
is annotated as containing 2 strong MWEs (sum_up, worth_every_penny) and 1
weak MWE (sum_up~with). Every sentence was reviewed by at least two
annotators.
This resource is described in the paper:
Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T.
Mordowanec, Henrietta Conrad, and Noah A. Smith. Comprehensive annotation
of multiword expressions in a social web
corpus<http://www.cs.cmu.edu/%7Enschneid/mwecorpus.pdf>
. LREC 2014.
The annotations can be downloaded from:
http://www.ark.cs.cmu.edu/LexSem/
That page also links to annotation guidelines, as well as an open source
MWE identification tool trained and evaluated on the corpus.
Cheers,
Nathan & collaborators at CMU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140528/c984a400/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list