[Corpora-List] Incidence of MWEs
Adam Kilgarriff
adam at lexmasterclass.com
Thu Mar 16 17:30:11 UTC 2006
Bob Amsler says:
> I have found published dictionary's judgments as to what constitute MWEs
> to be both dated and biased against declaring MWEs to exist.
> ...
> Take an MWE such as "pencil sharpener". Most dictionaries don't ...
UK dictionaries on my shelf do list "pencil sharpener" (Oxford D of E 98,
LDOCE 95, Macmillan E D 02). US ones (Random House 1987, M-W online) don't.
Moral is clear.
US dictionaries are ***way, way*** behind UK dictionaries in corpus use. UK
dictionary publishers lead the world in corpus development and use (with NLP
lagging behind). OUP and Longman were prime movers in developing the BNC,
and OUP is now on the point of launching its billion-word corpus of English.
Collins-COBUILD was the great pioneer in the 1980s. Macmillan was first
user of my very own word sketches (corpus analysis software).
That's all English: for German, Langenscheidt have been working with Uli
Heid's group at Univ Stuttgart to improve MWE coverage in their
dictionaries.
There are theoretical limitations to paper dictionaries - they cannot
usefully convey complex rules to their users. (To do so requires a
sophisticated metalanguage. Dictionary-user research is conclusive:
ordinary dictionary users don't read the manual. So there is no point
offering a sophisticated metalanguage. Worse, it confuses or scares.)
> I don't believe published dictionaries actually think about MWEs
> consistently or conscientiously.
Bob, I hope you don't believe it any longer!
Adam
PS - I have just been pointed to a recent and excellent thesis-length
treatment of the original question:
Bego~na Villada Moiron, "Data-driven identification of fixed expressions and
their modifiability" http://odur.let.rug.nl/~begona/
-----Original Message-----
From: owner-corpora at lists.uib.no [mailto:owner-corpora at lists.uib.no] On
Behalf Of Amsler, Robert
Sent: 15 March 2006 15:11
To: Corpora List
Subject: RE: [Corpora-List] Incidence of MWEs
I have found published dictionary's judgments as to what constitute MWEs
to be both dated and biased against declaring MWEs to exist. Until I
actually went through a number of texts to extract MWEs by hand and
compared those MWEs I found against those listed in dictionaries I used
to think the lexicographic coverage was adequate and followed the rule
that "if you can predict its meaning from its constituent parts, it
doesn't need a separate entry" to be correct. What I found was that not
only didn't the rule seem to be applied consistently, but that MWEs
appeared to be a much neglected area of lexicography with many more
undocumented MWEs being used in text than were in the dictionaries. It
was as though dictionaries reviewed their MWE entries far less often and
less diligently than they did their isolated word entries.
There are probably good reasons against dictionary publishers declaring
MWEs to exist. Namely, MWEs greatly increase the size of a dictionary
for a small gain in clarity, perhaps only useful to Speakers of English
as a Foreign Language (and practitioners of computational linguists,
information retrieval and artificial intelligence). The "prediction"
rule used to discount MWEs needing entries seems to beg the question of
what algorithm can predict these and what does that algorithm predict.
There is a big difference between believing you are excluding MWEs
because they are understandable without definitions and having an
algorithm that can generate the definition you would have written from
the separate dictionary entries for the component words.
Take an MWE such as "pencil sharpener". Most dictionaries don't define
this since according to the prediction rule, it could be assumed to be
just "a sharpener for pencils". However, that denies the fact that we
all know pencil sharpeners are a specific category of manufactured
product and if you look for a photo of a pencil sharpener it will have
one of several distinct models. We also know details about how pencil
sharpener's work. In contrast, things like a "stick sharpener" or a
"crayon sharpener" are novel creations without long-standing precedent
(I just checked the web, and, sigh, they both exist, but a "stick
sharpener" isn't a tool for sharpening sticks, it is a knife sharpener
whose shape resembles a stick, i.e., a thin cylindrical file.")
A pencil sharpener would be something like "an electrical, mechanical or
manual device with sharpened blades into which pencils can be inserted
and which when operated creates a tapered conical pointed tip on the
pencil which initializes or renews its ability to be used as a writing
implement"
Here is where I would say computational linguistics has to take its
leave of lexicography (or at least published lexicographic practice) and
declare "pencil sharpener" to be a useful and necessary MWE. I would
even go so far as to say that every MWE for which an explicit definition
can be written, should have an explicit definition and that ONLY when
the explicit definitions show no differentiation should they be
eliminated in favor of entries for the separate word elements. That is,
REVERSE the "prediction" rule to assume you cannot predict the meaning
of an MWE until you fail to find anything to say in its definition that
is not formulaic.
I don't believe published dictionaries contain sufficient information to
correctly understand the MWEs they fail to explicitly list. I don't
believe published dictionaries actually think about MWEs consistently or
conscientiously.
More information about the Corpora
mailing list