[Corpora-List] Incidence of MWEs
Gaël Dias
ddg at di.ubi.pt
Tue Mar 14 21:49:40 UTC 2006
Dear David,
If you are interesting by MWEs in Parsing you can read the work done by
J. Nivre and J. Nilsson:
J. Nivre and J. Nilsson (2004) Multiword Units in Syntactic Parsing.
Workshop on Methodologies and Evaluation of Multiword Units in
Real-world Applications (MEMURA Workshop) associated with the 4th
International Conference On Languages Resources and Evaluation. Dias,
G., Lopes, J.G.L. & Vintar, S. (eds), Lisbon, Portugal. May 25. pp.
17-24. ISBN: 2951740816. EAN: 0782951740815.
You can get it at
http://memura2004.di.ubi.pt/main-memura-proceedings-vInternet.pdf
If you know French, a very good book on MWE is:
G. Gross (1996) Les expressions figées en Français. Ophrys. Paris.
Best,
Gaël.
David Brooks wrote:
> Adam Kilgarriff wrote:
>
>>> I was wondering if anyone has estimated the incidence of multi-word
>>> expressions in language.
>>
>>
>> Wonderful, enormous, bottomless question!
>
>
> In a fit of ignorance I figured on reaping whatever information was
> retrievable from the query. However, I should certainly be more specific.
>
> I'm interested in the effect of MWEs on parser evaluation. Specifically,
> I want to describe the problems it poses for grammar induction.
>
> I presume that idiosyncratic MWEs are somehow treated differently to
> compositional MWEs, in that the latter could easily be incorporated into
> a treebank. Phrasal verbs also seem to be tagged in treebanks, but I'm
> intrigued as to the treatment of phrases like "kick the bucket", and
> perhaps more importantly: "at first", "of course", and other MWEs that
> almost represent "stop-phrases" (as opposed to stop-words). Some of
> these are syntactically valid, but does that mean they would be
> annotated (in phrase-structure terms) in a compositional manner, or
> would the idiomatic reading be preferred?
>
>> * are you counting types or tokens? (Exercise: what is the proportion
>> of multiwords in the mini-corpus comprising the single sentence,
>> "Apple pie
>> is apple pie." )
>> * what sublanguages do you include - all, some, none? ("mid off" is a
>> MWE for anyone who knows cricket but not for anyone who doesn't) *
>> how much variation (morphological, syntactic, lexical, modifiers)
>> can there be, with it still being the same MWE (or, an MWE at all)
>> (Rosamund Moon's example, are "shake in one's shoes", "quake in one's
>> boots"
>> and "quake in one's Doc Marten's" all the same MWE?)
>> * is non-compositionality a part of the definition?
>> * are frequencies or statistics part of the definition? (Theorists
>> might not want them to be, but without statistics and thresholds, you
>> won't
>> be able to compute a useful answer, and if you do use them, the answer
>> you
>> get will depend critically on which statistics and which thresholds
>> you use
>> so you had better make principled decisions about them)
>
>
> In answer to those questions:
> 1) I'd count tokens;
> 2) I'd include all sublanguages (since they will presumably be annotated
> correctly);
> 3) the notion of variation is presumably intrinsically linked with
> non-compositionality;
> 4) non-compositionality is a requirement in my definition;
> 5) from an inductive standpoint, I assume that statistics are necessary
> to identify these phrases in a corpus. I further assume that statistics
> are used in parsing, so should also be used in MWE identification.
>
> Cheers,
> D
--
---------------------------------------------------------
Gaël Harry Dias, PhD | Assistant Professor
Human Language Technology Group | Vice Chair of the Dept.
Computer Science Department | [www.di.ubi.pt/~ddg]
Beira Interior University | [ddg at di.ubi.pt]
6201-001 - Covilhã | [Tel: +351 275 319 891]
PORTUGAL | [Fax: +351 275 319 899]
---------------------------------------------------------
More information about the Corpora
mailing list