[Corpora-List] Most common non-Romance, non-Germanic words in English

Thu Apr 10 00:24:11 UTC 2014

Hi, Erin,

Sorry about the brevity. I was referring to Harald Baayen's book _Word
frequency distributions_ from 2001, published by Dordrecht, Netherlands:
Kluwer. I am copying this to the list, in case others were similarly
mystified by the reference.

Jim

PS: btw, the book was quite understandable (esp. if you have a mathematical
background) and, I found, also quite provocative. (I have made earlier
comments [within a few years of its being publshed] on this list.)

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO

On Tue, Apr 8, 2014 at 7:15 PM, Erin McKean <erin at wordnik.com> wrote:

> Dear Jim,
>
> I found this email fascinating, but I wonder if you might have a more
> explicit citation for Baayen 2001? Did you mean
>
> Krott, A., Schreuder, R. and Baayen, R.H. (2001) Analogy in morphology:
> modeling the choice of linking morphemes in Dutch, Linguistics 39, 51-93.
>
> It's the only paper of that date on his publications page:
> http://www.sfs.uni-tuebingen.de/~hbaayen/publications.html
>
> Any direction gratefully received!
>
> Yours,
>
> Erin
>
>
> On 4/8/14 2:09 PM, Jim Fidelholtz wrote:
>
>> Hi, Tristan,
>>
>> A couple of suggestions and comments. First of all, the figures I have
>> seen, though not totally consistent, suggest that slightly over 60% (the
>> mode seems to be about 63 percent, �which I usually round up to 'about
>> two-thirds') of English words (not counting Proper Nouns) is of Romance
>> origin. One assumes that the bulk of the remainder (about 37%) would
>> then be Germanic in origin. I can't remember ever seeing a discussion of
>> this part of English vocabulary, but we know that very many words come
>> from non-Romance languages, so I would guess that maybe less than 10% of
>> English words fit your criteria, at first blush. Nevertheless, you
>> specify that they do not have 'AN origin in any Romance or Germanic
>> language' (my emphasis), which is vague, to say the least. An example:
>> 'chocolate', which according to me surely would find its way into your
>> list, or should, comes from Nauatl, according to the online etymological
>> dictionary via Spanish and other European languages. In English, by
>> default we assume any borrowed word (especially food-related ones) would
>> come from French, which I believe is true in this case. So 'chocolate'
>> clearly has 'an' origin in French (Romance), although its ultimate
>> origin is Nauatl and it should therefore be in your list.�
>>
>> Likewise, you specify that you wish to find 'the most frequent [such]
>> words in English' which is also worse than vague, since theoretically
>> (George Bedell, MIT PhD thesis ca. 1969:
>> nationalizationalizationalize... -- and apparently practically as well:
>> see Baayen 2001) there are an infinite number of words in English; thus,
>> unless you specify a specific number of the most frequent non-Germanic
>> non-Romance English words, there will be an infinite number of them as
>> well (if you think I'm wrong in my count, just wait a few millennia!).
>>
>> The main point is, you need to specify your parameters more clearly
>> (even then, you will surely have a number of unclear or indeterminate
>> cases).
>>
>> If you are going to put an upper limit (say N) on the number of such
>> words, as a practical matter your quest should not be so difficult. Find
>> any huge list of English words (alternative: take the largest English
>> corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
>> site of Mark Davies, make up a list in frequency order, most frequent
>> first, and then eliminate all the obviously Romance words [any word
>> ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
>> in a similar way. Using prefixes from medical dictionaries, eliminate
>> all the words using them (almost all are from Greek or Latin; Latin is
>> Romance; the Greek ones almost all entered via Latin (medieval
>> university Latin or 'modern' Latin). This will at least shorten your
>> list a great deal. You should get an original corpus from Davies of
>> somewhere between 5 and 10 billion [American sense] words. This might
>> give you as much as 175,000,000 distinct word forms, based on figures
>> from a 5 million-word corpus (The American Heritage corpus, 1971),
>> before your winnowing. Of course, the figure will be much less, since I
>> haven't taken into consideration the geometric decrease in the number of
>> different word forms in larger corpora. In any case, you can see that
>> you will have some work left to do. I don't want to minimize how much
>> work you would have to do, but I think you have already thought up a
>> number of ways to cut down on it and others will surely occur to you,
>> even if you don't find just the kind of corpora to help you that you are
>> looking for. In any case, good luck.
>>
>> Jim
>>
>> James L. Fidelholtz
>> Posgrado en Ciencias del Lenguaje
>> Instituto de Ciencias Sociales y Humanidades
>> Benem�rita Universidad Aut�noma de Puebla, M�XICO
>>
>>
>>
>> On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller
>> <miller at ukp.informatik.tu-darmstadt.de
>> <mailto:miller at ukp.informatik.tu-darmstadt.de>> wrote:
>>
>>     Dear all,
>>
>>     I'm interested in finding the most frequent words in English which do
>>     not have an origin in any Romance or Germanic language. �Does anyone
>>     know if such a list is available anywhere?
>>
>>     If not, I suppose I could produce one myself easily enough by taking a
>>     raw frequency list (such as Adam Kilgarriff's BNC lemma counts),
>>     querying each entry in a machine-readable dictionary which provides
>>     etymological information, and filtering appropriately. �But that
>>     presupposes that such a dictionary exists. �Does anyone know of a
>>     suitable freely available dictionary for this task? �Since I'd need to
>>     automatically query many thousands of words, I'd want something that I
>>     can download for offline use and access through an API. �I could try
>>     accessing an offline dump of Wiktionary using the JWKTL API, though I
>>     suspect Wiktionary's etymological coverage is too spotty.
>>
>>     Regards,
>>     Tristan
>>
>>     --
>>     Tristan Miller, Research Scientist
>>     Ubiquitous Knowledge Processing Lab (UKP-TUDA)
>>     Department of Computer Science, Technische Universit�t Darmstadt
>>
>>     Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>>
>>
>>     _______________________________________________
>>     UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>>     Corpora mailing list
>>     Corpora at uib.no <mailto:Corpora at uib.no>
>>     http://mailman.uib.no/listinfo/corpora
>>
>>
>>
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140409/e577051a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora