[Corpora-List] Most common non-Romance, non-Germanic words in English

Thu Apr 10 14:09:22 UTC 2014

Dear Jim,

Thank you VERY much!

Yours,

Erin

On 4/9/14 5:24 PM, Jim Fidelholtz wrote:
> Hi, Erin,
>
> Sorry about the brevity. I was referring to Harald Baayen's book _Word
> frequency distributions_ from 2001, published by Dordrecht, Netherlands:
> Kluwer. I am copying this to the list, in case others were similarly
> mystified by the reference.
>
> Jim
>
> PS: btw, the book was quite understandable (esp. if you have a
> mathematical background) and, I found, also quite provocative. (I have
> made earlier comments [within a few years of its being publshed] on this
> list.)
>
> James L. Fidelholtz
> Posgrado en Ciencias del Lenguaje
> Instituto de Ciencias Sociales y Humanidades
> Benemérita Universidad Autónoma de Puebla, MÉXICO
>
>
> On Tue, Apr 8, 2014 at 7:15 PM, Erin McKean <erin at wordnik.com
> <mailto:erin at wordnik.com>> wrote:
>
>     Dear Jim,
>
>     I found this email fascinating, but I wonder if you might have a
>     more explicit citation for Baayen 2001? Did you mean
>
>     Krott, A., Schreuder, R. and Baayen, R.H. (2001) Analogy in
>     morphology: modeling the choice of linking morphemes in Dutch,
>     Linguistics 39, 51-93.
>
>     It's the only paper of that date on his publications page:
>     http://www.sfs.uni-tuebingen.__de/~hbaayen/publications.html
>     <http://www.sfs.uni-tuebingen.de/~hbaayen/publications.html>
>
>     Any direction gratefully received!
>
>     Yours,
>
>     Erin
>
>
>     On 4/8/14 2:09 PM, Jim Fidelholtz wrote:
>
>         Hi, Tristan,
>
>         A couple of suggestions and comments. First of all, the figures
>         I have
>         seen, though not totally consistent, suggest that slightly over
>         60% (the
>         mode seems to be about 63 percent, �which I usually round up to
>         'about
>         two-thirds') of English words (not counting Proper Nouns) is of
>         Romance
>         origin. One assumes that the bulk of the remainder (about 37%) would
>         then be Germanic in origin. I can't remember ever seeing a
>         discussion of
>         this part of English vocabulary, but we know that very many
>         words come
>         from non-Romance languages, so I would guess that maybe less
>         than 10% of
>         English words fit your criteria, at first blush. Nevertheless, you
>         specify that they do not have 'AN origin in any Romance or Germanic
>         language' (my emphasis), which is vague, to say the least. An
>         example:
>         'chocolate', which according to me surely would find its way
>         into your
>         list, or should, comes from Nauatl, according to the online
>         etymological
>         dictionary via Spanish and other European languages. In English, by
>         default we assume any borrowed word (especially food-related
>         ones) would
>         come from French, which I believe is true in this case. So
>         'chocolate'
>         clearly has 'an' origin in French (Romance), although its ultimate
>         origin is Nauatl and it should therefore be in your list.�
>
>         Likewise, you specify that you wish to find 'the most frequent
>         [such]
>         words in English' which is also worse than vague, since
>         theoretically
>         (George Bedell, MIT PhD thesis ca. 1969:
>         nationalizationalizationalize.__.. -- and apparently practically
>         as well:
>         see Baayen 2001) there are an infinite number of words in
>         English; thus,
>         unless you specify a specific number of the most frequent
>         non-Germanic
>         non-Romance English words, there will be an infinite number of
>         them as
>         well (if you think I'm wrong in my count, just wait a few
>         millennia!).
>
>         The main point is, you need to specify your parameters more clearly
>         (even then, you will surely have a number of unclear or
>         indeterminate
>         cases).
>
>         If you are going to put an upper limit (say N) on the number of such
>         words, as a practical matter your quest should not be so
>         difficult. Find
>         any huge list of English words (alternative: take the largest
>         English
>         corpus, eg, combine all of the Englsh corpora [COCA, etc.] on
>         the BYU
>         site of Mark Davies, make up a list in frequency order, most
>         frequent
>         first, and then eliminate all the obviously Romance words [any word
>         ending in -tion, -nce, etc.]; then eliminate the obviously
>         Germanic ones
>         in a similar way. Using prefixes from medical dictionaries,
>         eliminate
>         all the words using them (almost all are from Greek or Latin;
>         Latin is
>         Romance; the Greek ones almost all entered via Latin (medieval
>         university Latin or 'modern' Latin). This will at least shorten your
>         list a great deal. You should get an original corpus from Davies of
>         somewhere between 5 and 10 billion [American sense] words. This
>         might
>         give you as much as 175,000,000 distinct word forms, based on
>         figures
>         from a 5 million-word corpus (The American Heritage corpus, 1971),
>         before your winnowing. Of course, the figure will be much less,
>         since I
>         haven't taken into consideration the geometric decrease in the
>         number of
>         different word forms in larger corpora. In any case, you can see
>         that
>         you will have some work left to do. I don't want to minimize how
>         much
>         work you would have to do, but I think you have already thought up a
>         number of ways to cut down on it and others will surely occur to
>         you,
>         even if you don't find just the kind of corpora to help you that
>         you are
>         looking for. In any case, good luck.
>
>         Jim
>
>         James L. Fidelholtz
>         Posgrado en Ciencias del Lenguaje
>         Instituto de Ciencias Sociales y Humanidades
>         Benem�rita Universidad Aut�noma de Puebla, M�XICO
>
>
>
>         On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller
>         <miller at ukp.informatik.tu-__darmstadt.de
>         <mailto:miller at ukp.informatik.tu-darmstadt.de>
>         <mailto:miller at ukp.informatik.__tu-darmstadt.de
>         <mailto:miller at ukp.informatik.tu-darmstadt.de>>> wrote:
>
>              Dear all,
>
>              I'm interested in finding the most frequent words in
>         English which do
>              not have an origin in any Romance or Germanic language.
>         �Does anyone
>              know if such a list is available anywhere?
>
>              If not, I suppose I could produce one myself easily enough
>         by taking a
>              raw frequency list (such as Adam Kilgarriff's BNC lemma
>         counts),
>              querying each entry in a machine-readable dictionary which
>         provides
>              etymological information, and filtering appropriately. �But
>         that
>              presupposes that such a dictionary exists. �Does anyone
>         know of a
>              suitable freely available dictionary for this task? �Since
>         I'd need to
>              automatically query many thousands of words, I'd want
>         something that I
>              can download for offline use and access through an API. �I
>         could try
>              accessing an offline dump of Wiktionary using the JWKTL
>         API, though I
>              suspect Wiktionary's etymological coverage is too spotty.
>
>              Regards,
>              Tristan
>
>              --
>              Tristan Miller, Research Scientist
>              Ubiquitous Knowledge Processing Lab (UKP-TUDA)
>              Department of Computer Science, Technische Universit�t
>         Darmstadt
>
>              Tel: +49 6151 16 6166 | Web:
>         http://www.ukp.tu-darmstadt.__de/ <http://www.ukp.tu-darmstadt.de/>
>
>
>              _________________________________________________
>              UNSUBSCRIBE from this page:
>         http://mailman.uib.no/options/__corpora
>         <http://mailman.uib.no/options/corpora>
>              Corpora mailing list
>         Corpora at uib.no <mailto:Corpora at uib.no> <mailto:Corpora at uib.no
>         <mailto:Corpora at uib.no>>
>         http://mailman.uib.no/__listinfo/corpora
>         <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
>
>         _________________________________________________
>         UNSUBSCRIBE from this page:
>         http://mailman.uib.no/options/__corpora
>         <http://mailman.uib.no/options/corpora>
>         Corpora mailing list
>         Corpora at uib.no <mailto:Corpora at uib.no>
>         http://mailman.uib.no/__listinfo/corpora
>         <http://mailman.uib.no/listinfo/corpora>
>
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora