[Corpora-List] USENET corpus

Mark Davies Mark_Davies at byu.edu
Tue Jun 17 09:48:56 UTC 2008


One other question...

As I understand it, this corpus contains USENET messages from 2005-2007. I'm looking for a USENET corpus that goes back to about 1990 (to complement the 360+ million words 1990-2007 from "traditional" genres in the Corpus of American English: http://www.americancorpus.org).

Using Google Groups, it would be possible to create a "balanced" corpus for USENET, by defining the major "domains" (alt.*, rec.*, soc.*, comp.*, etc -- including subdivisions of these), and then getting a certain number of words per year (or whatever) from each of these domains. As with the Alberta corpus, one would have to strip out or convert duplicates, quoted text, emails and URLS, etc etc, but it should be doable.

Is anyone aware of either of the following, then:

1. A USENET corpus that goes back 15-20 years (beyond what's available at Google Groups), or
2. A "balanced" USENET corpus, with a specified number of words per domain per time period?

Thanks in advance.

Mark Davies

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list