[Lexicog] Digest Number 66

Mery Martinelli mmyit at YAHOO.IT
Thu Mar 11 15:54:07 UTC 2004




Right. That is the answer I expected. However, the question becomes more complex.

1)To create a big general bilingual dictionary  should we start from the number of lemmas which are most frequent in the wordlist extracted from our corpus and then increase or decrease their quantity according to the space available in the dictionary, or should we start from the number of entries generally needed in a dictionary of that size and than extract from the corpus the needed number of most frequent corpus lemmas?

2)If we start from the most frequent corpus lemmas, how should we establish which is the minimum frequency necessary to select a lemma?

In "Corpus and Text - Basic Principles" Sinclair writes that a frequency of two is the minimum to consider a linguistic item an independent unit of the language, but two occurrences are not sufficient to describe how a lemma is used. To describe the behaviour of a word we need at least 20 instances. But a lemma may represent more than a word. For example in Italian some adjectives have 4 different forms and to describe them the corpus should provide us at least 80 instances,  20 occurrences for each form. Moreover, if a lemma is a polysemic one, we should have 20 instances for each form of the lemma in each meaning.

If I am not wrong, a section of a big general bilingual dictionary may contain around 75 000 entries. As some entries are characterised by homonyms, the number of lemmas may be lower than 75000, imagine 60 000.

If all lemmas were monosemic and had a single form, to extract and describe 60 000 lemmas we should use a corpus of at least 1 200 000 words.

In fact, to be representative, a corpus should contain more than 1 200 000 words, because languages (at least those I know) are made of many lemmas with more than a morphological form and with more than a meaning.

Hence, how is it possible to establish the minimum size of a corpus in order to be sure that it is really representative?

Regards

Mery

Message: 4
Date: Wed, 10 Mar 2004 12:18:30 -0700
From: "Wayne Leman"
Subject: How to select words for a bilingual dictionary

Mery, I would try to practice corpus linguistics, using a computer to search
large corpuses of natural text (newspapers, conversations, etc.) then do
word counts (with the computer) to find the most commonly used words.

Wayne Leman
Cheyenne dictionary project

> Dear all,
> in my MA thesis on bilingual lexicography I am describing the ways in
which dictionary words can be selected. I know that it depends on the
variety of language treated in the dictionary. Imagine that you had to
select form your own language the words to treat in a big general language
bilingual dictionary and those for a pocket one, how would you do it?
> Regards,
> Mery Martinelli
> SSLMIT, Bologna (Italy)








---------------------------------
Yahoo! Mail: 6MB di spazio gratuito, 30MB per i tuoi allegati, l'antivirus, il filtro Anti-spam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lexicography/attachments/20040311/6819efe1/attachment.htm>


More information about the Lexicography mailing list