[Corpora-List] Incidence of MWEs
Ken Litkowski
ken at clres.com
Tue Mar 14 19:09:46 UTC 2006
This discussion has the earmarks of the theoretical, speculative, and
the hyperstatistical. I think a practical and newly available method
can be used.
Lexicographers, and most notably, OUP, have compiled their lists of what
they think constitute MWEs. In the electronic XML version of the Oxford
Dictionary of English (ODE), there is an NLP element that incorporates a
quite thorough list of all the variants, including placeholders in
phrases like "give [someone] a hard time". James McCracken, in a fit of
genius, has created on Online ODE and has a rudimentary disambiguation
of all content (non-boring) words in the dictionary. To do this, James
first created an index of all the variants, "squeezing" the phrases
together (e.g., "byandlarge"). As the first step in disambiguating, he
searches for longest phrases, starting from 5 words and continuing down
to 2 words, under the assumption that a phrasal reading is preferred to
a compositional reading. Upon walking through the Perl script that does
this (in about a half hour for the entire dictionary), my first reaction
after "wow" was what proportion of the definitions consist of these
phrases. Haven't done this yet, but it is simple, just requiring a
couple of modifications in the script to make the necessary counts. The
Perl script also is written in such a way that the same subroutines can
be applied to free text. This is all available to interested
researchers who would like to investigate these issues. (And also, it's
important to say that Adam Kilgarriff was a guiding spirit to James'
initial forays.)
Based strictly on a casual perusal of the resulting Online ODE, I would
say that a 2% figure is much more likely than a 30% (or even 70%) figure.
Ken
David Brooks wrote:
> Dear Corpora-folk,
>
> I was wondering if anyone has estimated the incidence of multi-word
> expressions in language. I know that empirical estimates are tied to
> particular corpora, but does anyone have an account of MWEs for
> particular corpora, so that "ball-park" figures of the proportion of
> MWEs can be estimated?
>
> Better yet, can anyone give me a good reference for the incidence of MWEs?
>
> Regards,
> David
--
Ken Litkowski TEL.: 301-482-0237
CL Research EMAIL: ken at clres.com
9208 Gue Road
Damascus, MD 20872-1025 USA Home Page: http://www.clres.com
More information about the Corpora
mailing list