[Corpora-List] Most common non-Romance, non-Germanic words in English

Jim Fidelholtz fidelholtz at gmail.com
Tue Apr 8 21:09:33 UTC 2014


Hi, Tristan,

A couple of suggestions and comments. First of all, the figures I have
seen, though not totally consistent, suggest that slightly over 60% (the
mode seems to be about 63 percent,  which I usually round up to 'about
two-thirds') of English words (not counting Proper Nouns) is of Romance
origin. One assumes that the bulk of the remainder (about 37%) would then
be Germanic in origin. I can't remember ever seeing a discussion of this
part of English vocabulary, but we know that very many words come from
non-Romance languages, so I would guess that maybe less than 10% of English
words fit your criteria, at first blush. Nevertheless, you specify that
they do not have 'AN origin in any Romance or Germanic language' (my
emphasis), which is vague, to say the least. An example: 'chocolate', which
according to me surely would find its way into your list, or should, comes
from Nauatl, according to the online etymological dictionary via Spanish
and other European languages. In English, by default we assume any borrowed
word (especially food-related ones) would come from French, which I believe
is true in this case. So 'chocolate' clearly has 'an' origin in French
(Romance), although its ultimate origin is Nauatl and it should therefore
be in your list.

Likewise, you specify that you wish to find 'the most frequent [such] words
in English' which is also worse than vague, since theoretically (George
Bedell, MIT PhD thesis ca. 1969: nationalizationalizationalize... -- and
apparently practically as well: see Baayen 2001) there are an infinite
number of words in English; thus, unless you specify a specific number of
the most frequent non-Germanic non-Romance English words, there will be an
infinite number of them as well (if you think I'm wrong in my count, just
wait a few millennia!).

The main point is, you need to specify your parameters more clearly (even
then, you will surely have a number of unclear or indeterminate cases).

If you are going to put an upper limit (say N) on the number of such words,
as a practical matter your quest should not be so difficult. Find any huge
list of English words (alternative: take the largest English corpus, eg,
combine all of the Englsh corpora [COCA, etc.] on the BYU site of Mark
Davies, make up a list in frequency order, most frequent first, and then
eliminate all the obviously Romance words [any word ending in -tion, -nce,
etc.]; then eliminate the obviously Germanic ones in a similar way. Using
prefixes from medical dictionaries, eliminate all the words using them
(almost all are from Greek or Latin; Latin is Romance; the Greek ones
almost all entered via Latin (medieval university Latin or 'modern' Latin).
This will at least shorten your list a great deal. You should get an
original corpus from Davies of somewhere between 5 and 10 billion [American
sense] words. This might give you as much as 175,000,000 distinct word
forms, based on figures from a 5 million-word corpus (The American Heritage
corpus, 1971), before your winnowing. Of course, the figure will be much
less, since I haven't taken into consideration the geometric decrease in
the number of different word forms in larger corpora. In any case, you can
see that you will have some work left to do. I don't want to minimize how
much work you would have to do, but I think you have already thought up a
number of ways to cut down on it and others will surely occur to you, even
if you don't find just the kind of corpora to help you that you are looking
for. In any case, good luck.

Jim

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO


On Tue, Apr 8, 2014 at 7:56 AM, Tristan Miller <
miller at ukp.informatik.tu-darmstadt.de> wrote:

> Dear all,
>
> I'm interested in finding the most frequent words in English which do
> not have an origin in any Romance or Germanic language.  Does anyone
> know if such a list is available anywhere?
>
> If not, I suppose I could produce one myself easily enough by taking a
> raw frequency list (such as Adam Kilgarriff's BNC lemma counts),
> querying each entry in a machine-readable dictionary which provides
> etymological information, and filtering appropriately.  But that
> presupposes that such a dictionary exists.  Does anyone know of a
> suitable freely available dictionary for this task?  Since I'd need to
> automatically query many thousands of words, I'd want something that I
> can download for offline use and access through an API.  I could try
> accessing an offline dump of Wiktionary using the JWKTL API, though I
> suspect Wiktionary's etymological coverage is too spotty.
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Research Scientist
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140408/ac0b8597/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list