[Corpora-List] Incidence of MWEs

Tue Mar 14 21:49:40 UTC 2006

Dear David,

If you are interesting by MWEs in Parsing you can read the work done by 
J. Nivre and J. Nilsson:

J. Nivre and J. Nilsson (2004) Multiword Units in Syntactic Parsing.
Workshop on Methodologies and Evaluation of Multiword Units in 
Real-world Applications (MEMURA Workshop) associated with the  4th 
International Conference On Languages Resources and Evaluation. Dias, 
G., Lopes, J.G.L. & Vintar, S. (eds), Lisbon, Portugal. May 25. pp. 
17-24. ISBN: 2951740816. EAN: 0782951740815.

You can get it at 
http://memura2004.di.ubi.pt/main-memura-proceedings-vInternet.pdf

If you know French, a very good book on MWE is:

G. Gross (1996) Les expressions figées en Français. Ophrys. Paris.

Best,

Gaël.

David Brooks wrote:

> Adam Kilgarriff wrote:
> 
>>> I was wondering if anyone has estimated the incidence of multi-word 
>>> expressions in language. 
>>
>>
>> Wonderful, enormous, bottomless question!
> 
> 
> In a fit of ignorance I figured on reaping whatever information was 
> retrievable from the query. However, I should certainly be more specific.
> 
> I'm interested in the effect of MWEs on parser evaluation. Specifically, 
> I want to describe the problems it poses for grammar induction.
> 
> I presume that idiosyncratic MWEs are somehow treated differently to 
> compositional MWEs, in that the latter could easily be incorporated into 
> a treebank. Phrasal verbs also seem to be tagged in treebanks, but I'm 
> intrigued as to the treatment of phrases like "kick the bucket", and 
> perhaps more importantly: "at first", "of course", and other MWEs that 
> almost represent "stop-phrases" (as opposed to stop-words). Some of 
> these are syntactically valid, but does that mean they would be 
> annotated (in phrase-structure terms) in a compositional manner, or 
> would the idiomatic reading be preferred?
> 
>> *    are you counting types or tokens?  (Exercise: what is the proportion
>> of multiwords in the mini-corpus comprising the single sentence, 
>> "Apple pie
>> is apple pie." )
>> *    what sublanguages do you include - all, some, none? ("mid off" is a
>> MWE for anyone who knows cricket but not for anyone who doesn't) *    
>> how much variation (morphological, syntactic, lexical, modifiers)
>> can there be, with it still being the same MWE (or, an MWE at all)
>> (Rosamund Moon's example, are "shake in one's shoes", "quake in one's 
>> boots"
>> and "quake in one's Doc Marten's" all the same MWE?)
>> *    is non-compositionality a part of the definition?
>> *    are frequencies or statistics part of the definition? (Theorists
>> might not want them to be, but without statistics and thresholds, you 
>> won't
>> be able to compute a useful answer, and if you do use them, the answer 
>> you
>> get will depend critically on which statistics and which thresholds 
>> you use
>> so you had better make principled decisions about them)
> 
> 
> In answer to those questions:
> 1) I'd count tokens;
> 2) I'd include all sublanguages (since they will presumably be annotated 
> correctly);
> 3) the notion of variation is presumably intrinsically linked with 
> non-compositionality;
> 4) non-compositionality is a requirement in my definition;
> 5) from an inductive standpoint, I assume that statistics are necessary 
> to identify these phrases in a corpus. I further assume that statistics 
> are used in parsing, so should also be used in MWE identification.
> 
> Cheers,
> D

-- 
---------------------------------------------------------
Gaël Harry Dias, PhD		| Assistant Professor
Human Language Technology Group | Vice Chair of the Dept.
Computer Science Department     | [www.di.ubi.pt/~ddg]
Beira Interior University       | [ddg at di.ubi.pt]
6201-001 - Covilhã              | [Tel: +351 275 319 891]
PORTUGAL                        | [Fax: +351 275 319 899]
---------------------------------------------------------