[Corpora-List] Incidence of MWEs

Tue Mar 14 13:39:00 UTC 2006

Adam Kilgarriff wrote:
>>I was wondering if anyone has estimated the incidence of multi-word 
>>expressions in language. 
> 
> Wonderful, enormous, bottomless question!

In a fit of ignorance I figured on reaping whatever information was 
retrievable from the query. However, I should certainly be more specific.

I'm interested in the effect of MWEs on parser evaluation. Specifically, 
I want to describe the problems it poses for grammar induction.

I presume that idiosyncratic MWEs are somehow treated differently to 
compositional MWEs, in that the latter could easily be incorporated into 
a treebank. Phrasal verbs also seem to be tagged in treebanks, but I'm 
intrigued as to the treatment of phrases like "kick the bucket", and 
perhaps more importantly: "at first", "of course", and other MWEs that 
almost represent "stop-phrases" (as opposed to stop-words). Some of 
these are syntactically valid, but does that mean they would be 
annotated (in phrase-structure terms) in a compositional manner, or 
would the idiomatic reading be preferred?

> *	are you counting types or tokens?  (Exercise: what is the proportion
> of multiwords in the mini-corpus comprising the single sentence, "Apple pie
> is apple pie." )
> *	what sublanguages do you include - all, some, none? ("mid off" is a
> MWE for anyone who knows cricket but not for anyone who doesn't) 
> *	how much variation (morphological, syntactic, lexical, modifiers)
> can there be, with it still being the same MWE (or, an MWE at all)
> (Rosamund Moon's example, are "shake in one's shoes", "quake in one's boots"
> and "quake in one's Doc Marten's" all the same MWE?)
> *	is non-compositionality a part of the definition?
> *	are frequencies or statistics part of the definition? (Theorists
> might not want them to be, but without statistics and thresholds, you won't
> be able to compute a useful answer, and if you do use them, the answer you
> get will depend critically on which statistics and which thresholds you use
> so you had better make principled decisions about them)

In answer to those questions:
1) I'd count tokens;
2) I'd include all sublanguages (since they will presumably be annotated 
correctly);
3) the notion of variation is presumably intrinsically linked with 
non-compositionality;
4) non-compositionality is a requirement in my definition;
5) from an inductive standpoint, I assume that statistics are necessary 
to identify these phrases in a corpus. I further assume that statistics 
are used in parsing, so should also be used in MWE identification.

Cheers,
D
-- 
David Brooks
http://www.cs.bham.ac.uk/~djb