[Corpora-List] Incidence of MWEs
Serge Sharoff
S.Sharoff at leeds.ac.uk
Tue Mar 14 13:39:49 UTC 2006
After Adam's response, it looks pointless to produce any figure below 30%, but I'll take the risk. Adam is right that the question is bottomless and offers many answers. However, we can operationalise the question by restricting the set of options to what is measurable. :
> * are you counting types or tokens? (Exercise: what is the proportion
> of multiwords in the mini-corpus comprising the single sentence, "Apple
> pie
> is apple pie." )
the coverage of original single-word tokens in a corpus by a list of MWEs.
> * what sublanguages do you include - all, some, none? ("mid off" is a
> MWE for anyone who knows cricket but not for anyone who doesn't)
the count will be indeed corpus specific, but I suspect that for written corpora the figure will be more or less stable (like the ZIpfian value for a variety of sublanguages).
> * how much variation (morphological, syntactic, lexical, modifiers)
> can there be, with it still being the same MWE (or, an MWE at all)
> (Rosamund Moon's example, are "shake in one's shoes", "quake in one's
> boots"
> and "quake in one's Doc Marten's" all the same MWE?)
My hunch that on a BNC-sized corpus the contribution of this variation will be relatively small.
> * are frequencies or statistics part of the definition? (Theorists
yes, after all we're answering the question about their frequency. It's unlikely that infrequent MWEs will affect the final count (again the Zipfian theme).
The trickiest question is about the definition of a (frequent) MWE.
> * is non-compositionality a part of the definition?
I attempted to count the number of MWEs of specific type: a preposition followed by a noun phrase, which is non-compositional (so frequent compositional constructions, like by the sea, were discarded), see
Serge Sharoff (2004) What is at stake: a case study of Russian expressions starting with a preposition. In Proc. of ACL04 Workshop Multiword Expressions: Integrating Processing. Barcelona, Spain, July, 2004. 17-23.
http://acl.ldc.upenn.edu/acl2004/mwe/pdf/sharoff.pdf
it's possible to produce a more or less precise list of expressions of this sort and detect their occurrences on a larger corpus. The count (consistent for several large Russian corpora) is about 2%, that's well below 30%, but again this is a subtype from a much wider range: this is the coverage for 720 expressions. There're still open questions for the chosen operationalisation:
1. do we count obligatory elements around chosen MWEs? ('at stake' doesn't cover 'to be at stake', even if the latter can be an MWE of its own)
2. do we insist on non-compositionality? (that study excluded 'by the window' from the list of candidates on the grounds that it's compositional; this expression will be outside of the list of MWEs in any reasonable count, but how about 'by the sea'? to which extent is it compositional?. Is there any agreed list of MWEs?
Serge
More information about the Corpora
mailing list