[Corpora-List] Incidence of MWEs

Ken Litkowski ken at clres.com
Tue Mar 14 19:09:46 UTC 2006


This discussion has the earmarks of the theoretical, speculative, and 
the hyperstatistical.  I think a practical and newly available method 
can be used.

Lexicographers, and most notably, OUP, have compiled their lists of what 
they think constitute MWEs.  In the electronic XML version of the Oxford 
Dictionary of English (ODE), there is an NLP element that incorporates a 
quite thorough list of all the variants, including placeholders in 
phrases like "give [someone] a hard time".  James McCracken, in a fit of 
genius, has created on Online ODE and has a rudimentary disambiguation 
of all content (non-boring) words in the dictionary.  To do this, James 
first created an index of all the variants, "squeezing" the phrases 
together (e.g., "byandlarge").  As the first step in disambiguating, he 
searches for longest phrases, starting from 5 words and continuing down 
to 2 words, under the assumption that a phrasal reading is preferred to 
a compositional reading.  Upon walking through the Perl script that does 
this (in about a half hour for the entire dictionary), my first reaction 
after "wow" was what proportion of the definitions consist of these 
phrases.  Haven't done this yet, but it is simple, just requiring a 
couple of modifications in the script to make the necessary counts.  The 
Perl script also is written in such a way that the same subroutines can 
be applied to free text.  This is all available to interested 
researchers who would like to investigate these issues.  (And also, it's 
important to say that Adam Kilgarriff was a guiding spirit to James' 
initial forays.)

Based strictly on a casual perusal of the resulting Online ODE, I would 
say that a 2% figure is much more likely than a 30% (or even 70%) figure.

	Ken

David Brooks wrote:

> Dear Corpora-folk,
> 
> I was wondering if anyone has estimated the incidence of multi-word 
> expressions in language. I know that empirical estimates are tied to 
> particular corpora, but does anyone have an account of MWEs for 
> particular corpora, so that "ball-park" figures of the proportion of 
> MWEs can be estimated?
> 
> Better yet, can anyone give me a good reference for the incidence of MWEs?
> 
> Regards,
> David


-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA       Home Page: http://www.clres.com



More information about the Corpora mailing list