[Lexicog] Percentage of idioms vs single words

Wed Feb 4 14:00:35 UTC 2004

Dear Ron,

Very interesting observations! I hadn’t looked at it that way yet. You 
probably already know, but in case you don't, John Sinclair has very 
interesting considerations on collocations, multi-word items and the like 
in his ‘Corpus, Concordance, Collocation’. Not a very recent publication, 
but still very inspiring. I don't have the book here, but I think it's 
Chapter 8. In the dictionary I have been working on intermittently for the 
last 14 years (Portuguese-Spanish; one way; only translated examples) I 
reached the same conclusion to the point that headwords were actually 
reduced to the number of 5000. The fact that it’s also an electronic 
dictionary allows me not to be obliged to think about how I would go about 
if I had to classify the multi-word items in a paper dictionary. (Here also 
Sinclair’s brief remarks are seminal.) The never-sufficiently-praised (as 
Don Quixote would say) Oxford English-Spanish Dictionary was a milestone in 
that respect. (There could be an earlier one
) But in this dictionary many 
of the words in one language are simply not translated but ‘used’ in a 
multi-word item. Which is then translated. (heedless: heedless OF sth: 
heedless of the danger, the regiment 
 haciendo caso omiso del peligro, el 
regimiento 
). Maybe multi-word items are predominant because that’s where 
common words acquire their meaning.

As for Patrick Hanks’ remarks on corpus, I totally agree, but I have a few 
observations. Everything depends on what you aim at with the examples of 
your dictionary (since this is what the corpus is used for). I was a 
critical observer at Cobuild 2 for a year. I started accepting what they 
called their ‘orthodoxy’, of accepting all corpus material as evidence. I 
ended up being convinced that producing evidence on language is one thing, 
and writing a dictionary a different one. It’s difficult to make a good 
product if you’re at the same time trying to make theoretical point.
I do think that, even allowing for some bias due to the almost exclusively 
written input of most corpora, that corpora reflect the real usage of 
words, on the condition that you know how to evaluate the data, i.e., 
introspectively. A great number of assertions, which I thought were beyond 
attack, crumbled when I started revising my own Portuguese-Spanish 
dictionary, and when I started learning Japanese a few years ago. In other 
words, when I started to confront my theoretical convictions with my 
practice as a language teacher and learner. If you want to teach a learner 
a word using not-meddled-with corpus examples, you need a lot of them. Say, 
twenty, depending on the case. And when you deal with a language pair of 
very different families -- Indo-european/Japanese – you must forget about 
using ‘natural (corpus) examples’ and resort to ‘grammatical examples’ 
excluding exactly multi-word items.
The main question is ‘what do I want my dictionary to help with?’ If the 
answer is ‘describe the language’, it seems beyond any doubt that corpora 
can help, or even ‘are’ the dictionary. If the answer is ‘teach a 
language’, which is what learner’s dictionaries presumably aim at, then the 
answer is not so clear. Monolingual dictionaries were invented, très 
tardivement, with a mixture of political and scientific intentions. They 
are the treasure room of knowledge on language. The aims of dictionaries 
with a concrete aim, foreign language dictionaries, are wholly different 
and have to take into account the wishes of the audience, also the 
unconscious ones. A vast topic.

Philippe Humblé
Universidade Federal de Santa Catarina (Brasil)

At 21:18 3/02/2004, you wrote:

One discovery (that has implications for us) was when I was trying to think
of English example words for each domain in my list of semantic domains. I
found that a high percentage were multi-word lexical items. In some domains
I quickly ran out of single word entries, but could think of lots of
phrases. This phenomenon was repeated in a couple of workshops for Bantu
languages. The speakers were generating about 25% phrases.

I presume (without a lot of data to back me up) that our dictionaries should
have a goodly percentage of multi-word entries. A quick scan of Longman's
Language Activator shows about 50% multi-word entries. Can anyone give
figures for their dictionaries? Has anyone worked at identifying/generating
multi-word lexical items in such a way that they can estimate the percentage
of idioms vs single words in a language? I realize that there is a gradation
from collocation to idiom, so that it may be difficult to draw a line.

Ron Moe
SIL, Uganda

-----Original Message-----
From: List Facilitator 
[<mailto:lexicography2004 at yahoo.com>mailto:lexicography2004 at yahoo.com]
Sent: Monday, February 02, 2004 10:11 PM
To: lexicographylist at yahoogroups.com
Subject: [Lexicog] Interesting lexical discoveries

What are one or two of the most interesting discoveries that stand out for
you (plural) in any of the lexical research that you have done?

Wayne Leman
Cheyenne dictionary project

Yahoo! Groups Sponsor
ADVERTISEMENT
<http://rd.yahoo.com/SIG=12cotkl7m/M=243273.4510124.5685162.1261774/D=egroupweb/S=1709195911:HM/EXP=1075925860/A=1750744/R=0/*http:/servedby.advertising.com/click/site=552006/bnum=1075839460612080> 

----------
Yahoo! Groups Links
    * To visit your group on the web, go to:
    * 
<http://groups.yahoo.com/group/lexicographylist/>http://groups.yahoo.com/group/lexicographylist/ 

    *
    * To unsubscribe from this group, send an email to:
    * 
<mailto:lexicographylist-unsubscribe at yahoogroups.com?subject=Unsubscribe>lexicographylist-unsubscribe at yahoogroups.com 

    *
    * Your use of Yahoo! Groups is subject to the 
<http://docs.yahoo.com/info/terms/>Yahoo! Terms of Service.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20040204/104a3ade/attachment.htm>