<div dir="ltr">Hi, Erin,<div><br></div><div>Sorry about the brevity. I was referring to Harald Baayen's book _Word frequency distributions_ from 2001, published by Dordrecht, Netherlands: Kluwer. I am copying this to the list, in case others were similarly mystified by the reference.</div>
<div><br></div><div>Jim</div><div><br></div><div>PS: btw, the book was quite understandable (esp. if you have a mathematical background) and, I found, also quite provocative. (I have made earlier comments [within a few years of its being publshed] on this list.)</div>
</div><div class="gmail_extra"><br clear="all"><div>James L. Fidelholtz<br>Posgrado en Ciencias del Lenguaje<br>Instituto de Ciencias Sociales y Humanidades<br>Benemérita Universidad Autónoma de Puebla, MÉXICO</div>
<br><br><div class="gmail_quote">On Tue, Apr 8, 2014 at 7:15 PM, Erin McKean <span dir="ltr"><<a href="mailto:erin@wordnik.com" target="_blank">erin@wordnik.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Dear Jim,<br>
<br>
I found this email fascinating, but I wonder if you might have a more explicit citation for Baayen 2001? Did you mean<br>
<br>
Krott, A., Schreuder, R. and Baayen, R.H. (2001) Analogy in morphology: modeling the choice of linking morphemes in Dutch, Linguistics 39, 51-93.<br>
<br>
It's the only paper of that date on his publications page: <a href="http://www.sfs.uni-tuebingen.de/~hbaayen/publications.html" target="_blank">http://www.sfs.uni-tuebingen.<u></u>de/~hbaayen/publications.html</a><br>
<br>
Any direction gratefully received!<br>
<br>
Yours,<br>
<br>
Erin<div><div class="h5"><br>
<br>
On 4/8/14 2:09 PM, Jim Fidelholtz wrote:<br>
</div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5">
Hi, Tristan,<br>
<br>
A couple of suggestions and comments. First of all, the figures I have<br>
seen, though not totally consistent, suggest that slightly over 60% (the<br>
mode seems to be about 63 percent, �which I usually round up to 'about<br>
two-thirds') of English words (not counting Proper Nouns) is of Romance<br>
origin. One assumes that the bulk of the remainder (about 37%) would<br>
then be Germanic in origin. I can't remember ever seeing a discussion of<br>
this part of English vocabulary, but we know that very many words come<br>
from non-Romance languages, so I would guess that maybe less than 10% of<br>
English words fit your criteria, at first blush. Nevertheless, you<br>
specify that they do not have 'AN origin in any Romance or Germanic<br>
language' (my emphasis), which is vague, to say the least. An example:<br>
'chocolate', which according to me surely would find its way into your<br>
list, or should, comes from Nauatl, according to the online etymological<br>
dictionary via Spanish and other European languages. In English, by<br>
default we assume any borrowed word (especially food-related ones) would<br>
come from French, which I believe is true in this case. So 'chocolate'<br>
clearly has 'an' origin in French (Romance), although its ultimate<br>
origin is Nauatl and it should therefore be in your list.�<br>
<br>
Likewise, you specify that you wish to find 'the most frequent [such]<br>
words in English' which is also worse than vague, since theoretically<br>
(George Bedell, MIT PhD thesis ca. 1969:<br>
nationalizationalizationalize.<u></u>.. -- and apparently practically as well:<br>
see Baayen 2001) there are an infinite number of words in English; thus,<br>
unless you specify a specific number of the most frequent non-Germanic<br>
non-Romance English words, there will be an infinite number of them as<br>
well (if you think I'm wrong in my count, just wait a few millennia!).<br>
<br>
The main point is, you need to specify your parameters more clearly<br>
(even then, you will surely have a number of unclear or indeterminate<br>
cases).<br>
<br>
If you are going to put an upper limit (say N) on the number of such<br>
words, as a practical matter your quest should not be so difficult. Find<br>
any huge list of English words (alternative: take the largest English<br>
corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU<br>
site of Mark Davies, make up a list in frequency order, most frequent<br>
first, and then eliminate all the obviously Romance words [any word<br>
ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones<br>
in a similar way. Using prefixes from medical dictionaries, eliminate<br>
all the words using them (almost all are from Greek or Latin; Latin is<br>
Romance; the Greek ones almost all entered via Latin (medieval<br>
university Latin or 'modern' Latin). This will at least shorten your<br>
list a great deal. You should get an original corpus from Davies of<br>
somewhere between 5 and 10 billion [American sense] words. This might<br>
give you as much as 175,000,000 distinct word forms, based on figures<br>
from a 5 million-word corpus (The American Heritage corpus, 1971),<br>
before your winnowing. Of course, the figure will be much less, since I<br>
haven't taken into consideration the geometric decrease in the number of<br>
different word forms in larger corpora. In any case, you can see that<br>
you will have some work left to do. I don't want to minimize how much<br>
work you would have to do, but I think you have already thought up a<br>
number of ways to cut down on it and others will surely occur to you,<br>
even if you don't find just the kind of corpora to help you that you are<br>
looking for. In any case, good luck.<br>
<br>
Jim<br>
<br>
James L. Fidelholtz<br>
Posgrado en Ciencias del Lenguaje<br>
Instituto de Ciencias Sociales y Humanidades<br></div></div>
Benem�rita Universidad Aut�noma de Puebla, M�XICO<div class=""><br>
<br>
<br>
On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller<br>
<<a href="mailto:miller@ukp.informatik.tu-darmstadt.de" target="_blank">miller@ukp.informatik.tu-<u></u>darmstadt.de</a><br></div><div class="">
<mailto:<a href="mailto:miller@ukp.informatik.tu-darmstadt.de" target="_blank">miller@ukp.informatik.<u></u>tu-darmstadt.de</a>>> wrote:<br>
<br>
Dear all,<br>
<br>
I'm interested in finding the most frequent words in English which do<br>
not have an origin in any Romance or Germanic language. �Does anyone<br>
know if such a list is available anywhere?<br>
<br>
If not, I suppose I could produce one myself easily enough by taking a<br>
raw frequency list (such as Adam Kilgarriff's BNC lemma counts),<br>
querying each entry in a machine-readable dictionary which provides<br>
etymological information, and filtering appropriately. �But that<br>
presupposes that such a dictionary exists. �Does anyone know of a<br>
suitable freely available dictionary for this task? �Since I'd need to<br>
automatically query many thousands of words, I'd want something that I<br>
can download for offline use and access through an API. �I could try<br>
accessing an offline dump of Wiktionary using the JWKTL API, though I<br>
suspect Wiktionary's etymological coverage is too spotty.<br>
<br>
Regards,<br>
Tristan<br>
<br>
--<br>
Tristan Miller, Research Scientist<br>
Ubiquitous Knowledge Processing Lab (UKP-TUDA)<br></div>
Department of Computer Science, Technische Universit�t Darmstadt<div class=""><br>
Tel: +49 6151 16 6166 | Web: <a href="http://www.ukp.tu-darmstadt.de/" target="_blank">http://www.ukp.tu-darmstadt.<u></u>de/</a><br>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br></div>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a> <mailto:<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a>><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><div class=""><br>
<br>
<br>
<br>
<br>
______________________________<u></u>_________________<br>
UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank">http://mailman.uib.no/options/<u></u>corpora</a><br>
Corpora mailing list<br>
<a href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>
<a href="http://mailman.uib.no/listinfo/corpora" target="_blank">http://mailman.uib.no/<u></u>listinfo/corpora</a><br>
<br>
</div></blockquote>
</blockquote></div><br></div>