Zero vs. "that" relatives (and TIME Corpus)

Tue Dec 30 18:50:18 UTC 2008

> Just curious, how many words is the TIME corpus?

100+ million words, 1920s-2000s.

Of course there are larger *text archives* (Google Books, NY Times, other newspapers, etc). But all of these have very limited architectures and interfaces:

-- find the first occurrence of a word
-- show all 18,489 occurrences of a word (one ... by ... one)
-- etc etc

None of those text archives can really do things like:

-- (easily) see the frequency over time (decade by decade, year by year)
-- use part of speech or lemmatization (thus pretty limited for syntactic change)
-- wildcards; see all matching forms (thus pretty limited for morphological change)
-- collocates (thus pretty limited for semantic change)
-- use the frequency in different historical periods as part of the query (e.g. collocates of Word X in Time Y vs Time Z)

The TIME Corpus can do all of these.

Of course, it is just one source in just one genre -- hence the need for something like the Corpus of Historical American English.

 ============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

------------------------------------------------------------
The American Dialect Society - http://www.americandialect.org