<div dir="ltr">Hi, Erin,<div><br></div><div>Sorry about the brevity. I was referring to Harald Baayen's book _Word frequency distributions_ from 2001, published by Dordrecht, Netherlands: Kluwer. I am copying this to the list, in case others were similarly mystified by the reference.</div>

<div><br></div><div>Jim</div><div><br></div><div>PS: btw, the book was quite understandable (esp. if you have a mathematical background) and, I found, also quite provocative. (I have made earlier comments [within a few years of its being publshed] on this list.)</div>

</div><div class="gmail_extra"><br clear="all"><div>James L. Fidelholtz<br>Posgrado en Ciencias del Lenguaje<br>Instituto de Ciencias Sociales y Humanidades<br>Benemérita Universidad Autónoma de Puebla, MÉXICO</div>

<br><br><div class="gmail_quote">On Tue, Apr 8, 2014 at 7:15 PM, Erin McKean <span dir="ltr"><<a href="mailto:erin@wordnik.com" target="_blank">erin@wordnik.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Dear Jim,<br>

<br>

I found this email fascinating, but I wonder if you might have a more explicit citation for Baayen 2001? Did you mean<br>

<br>

Krott, A., Schreuder, R. and Baayen, R.H. (2001) Analogy in morphology: modeling the choice of linking morphemes in Dutch, Linguistics 39, 51-93.<br>

<br>

It's the only paper of that date on his publications page: <a href="http://www.sfs.uni-tuebingen.de/~hbaayen/publications.html" target="_blank">http://www.sfs.uni-tuebingen.<u></u>de/~hbaayen/publications.html</a><br>


<br>

Any direction gratefully received!<br>

<br>

Yours,<br>

<br>

Erin<div><div class="h5"><br>

<br>

On 4/8/14 2:09 PM, Jim Fidelholtz wrote:<br>

</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">

Hi, Tristan,<br>

<br>

A couple of suggestions and comments. First of all, the figures I have<br>

seen, though not totally consistent, suggest that slightly over 60% (the<br>

mode seems to be about 63 percent, �which I usually round up to 'about<br>

two-thirds') of English words (not counting Proper Nouns) is of Romance<br>

origin. One assumes that the bulk of the remainder (about 37%) would<br>

then be Germanic in origin. I can't remember ever seeing a discussion of<br>

this part of English vocabulary, but we know that very many words come<br>

from non-Romance languages, so I would guess that maybe less than 10% of<br>

English words fit your criteria, at first blush. Nevertheless, you<br>

specify that they do not have 'AN origin in any Romance or Germanic<br>

language' (my emphasis), which is vague, to say the least. An example:<br>

'chocolate', which according to me surely would find its way into your<br>

list, or should, comes from Nauatl, according to the online etymological<br>

dictionary via Spanish and other European languages. In English, by<br>

default we assume any borrowed word (especially food-related ones) would<br>

come from French, which I believe is true in this case. So 'chocolate'<br>

clearly has 'an' origin in French (Romance), although its ultimate<br>

origin is Nauatl and it should therefore be in your list.�<br>

<br>

Likewise, you specify that you wish to find 'the most frequent [such]<br>

words in English' which is also worse than vague, since theoretically<br>

(George Bedell, MIT PhD thesis ca. 1969:<br>

nationalizationalizationalize.<u></u>.. -- and apparently practically as well:<br>

see Baayen 2001) there are an infinite number of words in English; thus,<br>

unless you specify a specific number of the most frequent non-Germanic<br>

non-Romance English words, there will be an infinite number of them as<br>

well (if you think I'm wrong in my count, just wait a few millennia!).<br>

<br>

The main point is, you need to specify your parameters more clearly<br>

(even then, you will surely have a number of unclear or indeterminate<br>

cases).<br>

<br>

If you are going to put an upper limit (say N) on the number of such<br>

words, as a practical matter your quest should not be so difficult. Find<br>

any huge list of English words (alternative: take the largest English<br>

corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU<br>

site of Mark Davies, make up a list in frequency order, most frequent<br>

first, and then eliminate all the obviously Romance words [any word<br>

ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones<br>

in a similar way. Using prefixes from medical dictionaries, eliminate<br>

all the words using them (almost all are from Greek or Latin; Latin is<br>

Romance; the Greek ones almost all entered via Latin (medieval<br>

university Latin or 'modern' Latin). This will at least shorten your<br>

list a great deal. You should get an original corpus from Davies of<br>

somewhere between 5 and 10 billion [American sense] words. This might<br>

give you as much as 175,000,000 distinct word forms, based on figures<br>

from a 5 million-word corpus (The American Heritage corpus, 1971),<br>

before your winnowing. Of course, the figure will be much less, since I<br>

haven't taken into consideration the geometric decrease in the number of<br>

different word forms in larger corpora. In any case, you can see that<br>

you will have some work left to do. I don't want to minimize how much<br>

work you would have to do, but I think you have already thought up a<br>

number of ways to cut down on it and others will surely occur to you,<br>

even if you don't find just the kind of corpora to help you that you are<br>

looking for. In any case, good luck.<br>

<br>

Jim<br>

<br>

James L. Fidelholtz<br>

Posgrado en Ciencias del Lenguaje<br>

Instituto de Ciencias Sociales y Humanidades<br></div></div>

Benem�rita Universidad Aut�noma de Puebla, M�XICO<div class=""><br>

<br>

<br>

On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller<br>

<<a href="mailto:miller@ukp.informatik.tu-darmstadt.de" target="_blank">miller@ukp.informatik.tu-<u></u>darmstadt.de</a><br></div><div class="">

<mailto:<a href="mailto:miller@ukp.informatik.tu-darmstadt.de" target="_blank">miller@ukp.informatik.<u></u>tu-darmstadt.de</a>>> wrote:<br>

<br>

    Dear all,<br>

<br>

    I'm interested in finding the most frequent words in English which do<br>

    not have an origin in any Romance or Germanic language. �Does anyone<br>

    know if such a list is available anywhere?<br>

<br>

    If not, I suppose I could produce one myself easily enough by taking a<br>

    raw frequency list (such as Adam Kilgarriff's BNC lemma counts),<br>

    querying each entry in a machine-readable dictionary which provides<br>

    etymological information, and filtering appropriately. �But that<br>

    presupposes that such a dictionary exists. �Does anyone know of a<br>

    suitable freely available dictionary for this task? �Since I'd need to<br>

    automatically query many thousands of words, I'd want something that I<br>

    can download for offline use and access through an API. �I could try<br>

    accessing an offline dump of Wiktionary using the JWKTL API, though I<br>

    suspect Wiktionary's etymological coverage is too spotty.<br>

<br>

    Regards,<br>

    Tristan<br>

<br>

    --<br>

    Tristan Miller, Research Scientist<br>

    Ubiquitous Knowledge Processing Lab (UKP-TUDA)<br></div>

    Department of Computer Science, Technische Universit�t Darmstadt<div class=""><br>

    Tel: +49 6151 16 6166 | Web: <a href="http://www.ukp.tu-darmstadt.de/" target="_blank">http://www.ukp.tu-darmstadt.<u></u>de/</a><br>

<br>

<br>

    ______________________________<u></u>_________________<br>

    UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>

    Corpora mailing list<br></div>

    <a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a> <mailto:<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>><br>

    <a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><div class=""><br>

<br>

<br>

<br>

<br>

______________________________<u></u>_________________<br>

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>

Corpora mailing list<br>

<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>

<br>

</div></blockquote>

</blockquote></div><br></div>