[Corpora-List] Most common non-Romance, non-Germanic words in English

Thu Apr 10 01:22:31 UTC 2014

Hi, Tristan,

OK on the source lg, although it might help if we knew what the purpose of
your list would be. I didn't have any illusions that you wanted *all* such
words, although you can see from other comments that there is a large
number of 'other' languages to pick your loanwords from, and I still
haven't seen any (even approximate) *number* of such borrowings--but you
would have to take (starting with the most frequent words) a rather large
list of words to even begin processing--using the various methods you and
others have suggested for winnowing down the list.

Unless, for example, you are interested in doing a similar study for
borrowings into Spanish, I don't think you can really find *any* English
borrowings originally from Nauatl that did not come via Spanish originally,
and this would eliminate really any words from this language, a result
which I would find unfortunate for any research I can imagine on borrowings
into English from (various) languages, even eliminating the ones you want
to eliminate. Btw, another problem you will find in this regard is
determining reasonably for less frequent (but still among the 'more
frequent', depending on how you define this) words exactly from what
language it was taken. Indeed, there are often clues in the phonological
development in English of the word as to what language it must have come
from (ie, via) originally, but at least sometimes, for Nauatl, some
speakers or varieties of English may have been in contact with speakers of
Nauatl, and the word may have been borrowed independently from more than
one language and/or at different times. Multilingualism is very complex, on
the one hand, and etymologists are known to commit (and perpetrate) errors,
on the other hand. This, of course, includes folk etymology, which is
rampant in, e. g., place names, among other things.

Also btw, the point of starting with an extremely large corpus (well over a
billion words) would be to try to minimize the effect which tends to
scramble words on the frequency list below, say, the first thousand [note:
even below about 100 you will find pretty large variation in positions of
words in the list each and every time you redo a count with new (even
comparable) data selected], by a fairly large number of positions (this is
why Carroll et al. factored in very importantly their measure of genre
distribution, so that among the very *last* words listed by frequency
(after all but a few tens of real hapax words, among the several tens of
thousands of hapax) are a few words which occur 2 or more times in the
whole corpus, but only in the genre [religion] with the fewest selections
taken in forming the corpus).

I'd be interested to hear more about your project, in any case.

Jim

James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO

On Wed, Apr 9, 2014 at 11:30 AM, Tristan Miller <
miller at ukp.informatik.tu-darmstadt.de> wrote:

> Dear Jim,
>
> Thanks for your insightful remarks.  To address a few matters:
>
> On 08/04/14 11:09 PM, Jim Fidelholtz wrote:
> > Nevertheless, you
> > specify that they do not have 'AN origin in any Romance or Germanic
> > language' (my emphasis), which is vague, to say the least. An example:
> > 'chocolate', which according to me surely would find its way into your
> > list, or should, comes from Nauatl, according to the online etymological
> > dictionary via Spanish and other European languages. In English, by
> > default we assume any borrowed word (especially food-related ones) would
> > come from French, which I believe is true in this case. So 'chocolate'
> > clearly has 'an' origin in French (Romance), although its ultimate
> > origin is Nauatl and it should therefore be in your list.
>
> I don't think this part of my phrasing was vague, as you seem to have
> interpreted it correctly.  Yes, I mean to exclude words like "chocolate"
> which arrived in English via French.  If I had wanted them in my list, I
> might have written something like "words whose earliest post-PIE origin
> cannot be traced to a Germanic or Romance language".
>
> > Likewise, you specify that you wish to find 'the most frequent [such]
> > words in English' which is also worse than vague, since theoretically
> > (George Bedell, MIT PhD thesis ca. 1969:
> > nationalizationalizationalize... -- and apparently practically as well:
> > see Baayen 2001) there are an infinite number of words in English; thus,
> > unless you specify a specific number of the most frequent non-Germanic
> > non-Romance English words, there will be an infinite number of them as
> > well (if you think I'm wrong in my count, just wait a few millennia!).
>
> Well, I thought it would have gone without saying that I didn't want
> *all* such words -- after all, I made reference in my message to using
> existing corpora, which must be of finite size. :)
>
> > If you are going to put an upper limit (say N) on the number of such
> > words, as a practical matter your quest should not be so difficult. Find
> > any huge list of English words (alternative: take the largest English
> > corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
> > site of Mark Davies, make up a list in frequency order, most frequent
> > first, and then eliminate all the obviously Romance words [any word
> > ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
> > in a similar way. Using prefixes from medical dictionaries, eliminate
> > all the words using them (almost all are from Greek or Latin; Latin is
> > Romance; the Greek ones almost all entered via Latin (medieval
> > university Latin or 'modern' Latin). This will at least shorten your
> > list a great deal. You should get an original corpus from Davies of
> > somewhere between 5 and 10 billion [American sense] words. This might
> > give you as much as 175,000,000 distinct word forms, based on figures
> > from a 5 million-word corpus (The American Heritage corpus, 1971),
> > before your winnowing.
>
> I think the more important limiting factor for the list is not the
> number of words in the corpus, but rather the number of words in the
> etymological MRD.  That is, assuming the frequency counts are already
> available, there's no need to heuristically exclude Romance and Germanic
> words (and indeed, I don't think I'd want to, as in my experience you
> get too many false positives).  In the first instance we can simply
> filter out all words which don't appear in the dictionary, and then look
> up the remainder.  In the worst case this will involve looking up every
> word in the dictionary once, which, if done automatically, can't take
> more than a few hours or days of computing time.  The problem is finding
> such a dictionary and an API therefor.  I've got an offline copy of the
> OED2, though I don't know if it's possible to query via API, or how easy
> it would be to parse the etymological information.
>
> Regards,
> Tristan
>
> --
> Tristan Miller, Research Scientist
> Ubiquitous Knowledge Processing Lab (UKP-TUDA)
> Department of Computer Science, Technische Universität Darmstadt
> Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140409/83b26134/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora