[Corpora-List] Most common non-Romance, non-Germanic words in English

Tristan Miller miller at ukp.informatik.tu-darmstadt.de
Wed Jun 11 14:16:31 UTC 2014


Greetings.

On 08/04/14 02:56 PM, Tristan Miller wrote:
> I'm interested in finding the most frequent words in English which do
> not have an origin in any Romance or Germanic language.  Does anyone
> know if such a list is available anywhere?
> 
> If not, I suppose I could produce one myself easily enough by taking a
> raw frequency list (such as Adam Kilgarriff's BNC lemma counts),
> querying each entry in a machine-readable dictionary which provides
> etymological information, and filtering appropriately.  But that
> presupposes that such a dictionary exists.  Does anyone know of a
> suitable freely available dictionary for this task?  Since I'd need to
> automatically query many thousands of words, I'd want something that I
> can download for offline use and access through an API.  I could try
> accessing an offline dump of Wiktionary using the JWKTL API, though I
> suspect Wiktionary's etymological coverage is too spotty.

Just in case anyone else was interested in this thread I started a while
back, I'd like to report on another resource I discovered at LREC 2014.
 Gerard de Melo has constructed a large, machine-readable resource of
etymological information extracted from Wiktionary.

De Melo's paper confirms my suspicions that Wiktionary's etymological
coverage is incomplete, though he found that a bigger problem was the
unstructured manner in which Wiktionary presented etymological
information.  He solved this through various pattern matching
techniques.  The resulting resource is browseable online, or available
for download.  There's also a Java API to query the offline version.

Etymological Wordnet and the paper covering it are available at
<http://www1.icsi.berkeley.edu/~demelo/etymwn/>.

Regards,
Tristan

-- 
Tristan Miller, Research Scientist
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 901 bytes
Desc: OpenPGP digital signature
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20140611/720e53fc/attachment-0001.sig>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list