[Corpora-List] the ebb and flow of inclusion of words in OED?

Tue Apr 26 17:53:22 UTC 2011

Martin Mueller wrote:

>> A much better source, to pick up from John Sowa's
suggestion, to would be the 30,000 EEBO texts that have been transcribed
and the 40,000 that will be transcribed over the next four years.  Do
lemmatization and morphosyntactic analysis for every word and think of the
combination of lemma and POS tag as an abstract entity whose orthographic
manifestations can be put on a time line.

This could be done quite easily with the 400 million word Corpus of Historical American English (http://corpus.byu.edu/coha); 100,000 texts from fiction, popular magazines, newspapers, and other non-fiction. Backend (in the relational database), one could find, for example, all adjectives that occur at least three times in decade X (e.g. 1920s) that don't occur in any of the preceding decades (e.g. 1810s-1910s), and repeat this for each of the 20 decades in the corpus, to see the number of "new words" per decade. 

Compared to the OED, COHA has the advantages that 1) it's more than 10 times as large (400 million vs 37 million words in the OED "corpus" of 2.2 million quotations), and 2) it is tagged and lemmatized (using CLAWS). The downside to COHA is that it's only 1810-2009.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora