[Corpora-List] Incidence of MWEs
David Brooks
D.J.Brooks at cs.bham.ac.uk
Tue Mar 14 13:39:00 UTC 2006
Adam Kilgarriff wrote:
>>I was wondering if anyone has estimated the incidence of multi-word
>>expressions in language.
>
> Wonderful, enormous, bottomless question!
In a fit of ignorance I figured on reaping whatever information was
retrievable from the query. However, I should certainly be more specific.
I'm interested in the effect of MWEs on parser evaluation. Specifically,
I want to describe the problems it poses for grammar induction.
I presume that idiosyncratic MWEs are somehow treated differently to
compositional MWEs, in that the latter could easily be incorporated into
a treebank. Phrasal verbs also seem to be tagged in treebanks, but I'm
intrigued as to the treatment of phrases like "kick the bucket", and
perhaps more importantly: "at first", "of course", and other MWEs that
almost represent "stop-phrases" (as opposed to stop-words). Some of
these are syntactically valid, but does that mean they would be
annotated (in phrase-structure terms) in a compositional manner, or
would the idiomatic reading be preferred?
> * are you counting types or tokens? (Exercise: what is the proportion
> of multiwords in the mini-corpus comprising the single sentence, "Apple pie
> is apple pie." )
> * what sublanguages do you include - all, some, none? ("mid off" is a
> MWE for anyone who knows cricket but not for anyone who doesn't)
> * how much variation (morphological, syntactic, lexical, modifiers)
> can there be, with it still being the same MWE (or, an MWE at all)
> (Rosamund Moon's example, are "shake in one's shoes", "quake in one's boots"
> and "quake in one's Doc Marten's" all the same MWE?)
> * is non-compositionality a part of the definition?
> * are frequencies or statistics part of the definition? (Theorists
> might not want them to be, but without statistics and thresholds, you won't
> be able to compute a useful answer, and if you do use them, the answer you
> get will depend critically on which statistics and which thresholds you use
> so you had better make principled decisions about them)
In answer to those questions:
1) I'd count tokens;
2) I'd include all sublanguages (since they will presumably be annotated
correctly);
3) the notion of variation is presumably intrinsically linked with
non-compositionality;
4) non-compositionality is a requirement in my definition;
5) from an inductive standpoint, I assume that statistics are necessary
to identify these phrases in a corpus. I further assume that statistics
are used in parsing, so should also be used in MWE identification.
Cheers,
D
--
David Brooks
http://www.cs.bham.ac.uk/~djb
More information about the Corpora
mailing list