[Corpora-List] Most common non-Romance, non-Germanic words in English

Wed Apr 9 16:30:07 UTC 2014

Dear Jim,

Thanks for your insightful remarks.  To address a few matters:

On 08/04/14 11:09 PM, Jim Fidelholtz wrote:
> Nevertheless, you
> specify that they do not have 'AN origin in any Romance or Germanic
> language' (my emphasis), which is vague, to say the least. An example:
> 'chocolate', which according to me surely would find its way into your
> list, or should, comes from Nauatl, according to the online etymological
> dictionary via Spanish and other European languages. In English, by
> default we assume any borrowed word (especially food-related ones) would
> come from French, which I believe is true in this case. So 'chocolate'
> clearly has 'an' origin in French (Romance), although its ultimate
> origin is Nauatl and it should therefore be in your list. 

I don't think this part of my phrasing was vague, as you seem to have
interpreted it correctly.  Yes, I mean to exclude words like "chocolate"
which arrived in English via French.  If I had wanted them in my list, I
might have written something like "words whose earliest post-PIE origin
cannot be traced to a Germanic or Romance language".

> Likewise, you specify that you wish to find 'the most frequent [such]
> words in English' which is also worse than vague, since theoretically
> (George Bedell, MIT PhD thesis ca. 1969:
> nationalizationalizationalize... -- and apparently practically as well:
> see Baayen 2001) there are an infinite number of words in English; thus,
> unless you specify a specific number of the most frequent non-Germanic
> non-Romance English words, there will be an infinite number of them as
> well (if you think I'm wrong in my count, just wait a few millennia!).

Well, I thought it would have gone without saying that I didn't want
*all* such words -- after all, I made reference in my message to using
existing corpora, which must be of finite size. :)

> If you are going to put an upper limit (say N) on the number of such
> words, as a practical matter your quest should not be so difficult. Find
> any huge list of English words (alternative: take the largest English
> corpus, eg, combine all of the Englsh corpora [COCA, etc.] on the BYU
> site of Mark Davies, make up a list in frequency order, most frequent
> first, and then eliminate all the obviously Romance words [any word
> ending in -tion, -nce, etc.]; then eliminate the obviously Germanic ones
> in a similar way. Using prefixes from medical dictionaries, eliminate
> all the words using them (almost all are from Greek or Latin; Latin is
> Romance; the Greek ones almost all entered via Latin (medieval
> university Latin or 'modern' Latin). This will at least shorten your
> list a great deal. You should get an original corpus from Davies of
> somewhere between 5 and 10 billion [American sense] words. This might
> give you as much as 175,000,000 distinct word forms, based on figures
> from a 5 million-word corpus (The American Heritage corpus, 1971),
> before your winnowing.

I think the more important limiting factor for the list is not the
number of words in the corpus, but rather the number of words in the
etymological MRD.  That is, assuming the frequency counts are already
available, there's no need to heuristically exclude Romance and Germanic
words (and indeed, I don't think I'd want to, as in my experience you
get too many false positives).  In the first instance we can simply
filter out all words which don't appear in the dictionary, and then look
up the remainder.  In the worst case this will involve looking up every
word in the dictionary once, which, if done automatically, can't take
more than a few hours or days of computing time.  The problem is finding
such a dictionary and an API therefor.  I've got an offline copy of the
OED2, though I don't know if it's possible to query via API, or how easy
it would be to parse the etymological information.

Regards,
Tristan

-- 
Tristan Miller, Research Scientist
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 901 bytes
Desc: OpenPGP digital signature
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140409/605aaa9a/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora