[Corpora-List] Incidence of MWEs

Wed Mar 15 15:11:21 UTC 2006

I have found published dictionary's judgments as to what constitute MWEs
to be both dated and biased against declaring MWEs to exist. Until I
actually went through a number of texts to extract MWEs by hand and
compared those MWEs I found against those listed in dictionaries I used
to think the lexicographic coverage was adequate and followed the rule
that "if you can predict its meaning from its constituent parts, it
doesn't need a separate entry" to be correct. What I found was that not
only didn't the rule seem to be applied consistently, but that MWEs
appeared to be a much neglected area of lexicography with many more
undocumented MWEs being used in text than were in the dictionaries. It
was as though dictionaries reviewed their MWE entries far less often and
less diligently than they did their isolated word entries.

There are probably good reasons against dictionary publishers declaring
MWEs to exist. Namely, MWEs greatly increase the size of a dictionary
for a small gain in clarity, perhaps only useful to Speakers of English
as a Foreign Language (and practitioners of computational linguists,
information retrieval and artificial intelligence). The "prediction"
rule used to discount MWEs needing entries seems to beg the question of
what algorithm can predict these and what does that algorithm predict.
There is a big difference between believing you are excluding MWEs
because they are understandable without definitions and having an
algorithm that can generate the definition you would have written from
the separate dictionary entries for the component words.

Take an MWE such as "pencil sharpener". Most dictionaries don't define
this since according to the prediction rule, it could be assumed to be
just "a sharpener for pencils". However, that denies the fact that we
all know pencil sharpeners are a specific category of manufactured
product and if you look for a photo of a pencil sharpener it will have
one of several distinct models. We also know details about how pencil
sharpener's work. In contrast, things like a "stick sharpener" or a
"crayon sharpener" are novel creations without long-standing precedent
(I just checked the web, and, sigh, they both exist, but a "stick
sharpener" isn't a tool for sharpening sticks, it is a knife sharpener
whose shape resembles a stick, i.e., a thin cylindrical file.") 

A pencil sharpener would be something like "an electrical, mechanical or
manual device with sharpened blades into which pencils can be inserted
and which when operated creates a tapered conical pointed tip on the
pencil which initializes or renews its ability to be used as a writing
implement"

Here is where I would say computational linguistics has to take its
leave of lexicography (or at least published lexicographic practice) and
declare "pencil sharpener" to be a useful and necessary MWE. I would
even go so far as to say that every MWE for which an explicit definition
can be written, should have an explicit definition and that ONLY when
the explicit definitions show no differentiation should they be
eliminated in favor of entries for the separate word elements. That is,
REVERSE the "prediction" rule to assume you cannot predict the meaning
of an MWE until you fail to find anything to say in its definition that
is not formulaic. 

I don't believe published dictionaries contain sufficient information to
correctly understand the MWEs they fail to explicitly list. I don't
believe published dictionaries actually think about MWEs consistently or
conscientiously.