18.1886, FYI: New 100+ Million Word Corpus of American English

LINGUIST Network linguist at LINGUISTLIST.ORG
Thu Jun 21 21:19:26 UTC 2007


LINGUIST List: Vol-18-1886. Thu Jun 21 2007. ISSN: 1068 - 4875.

Subject: 18.1886, FYI: New 100+ Million Word Corpus of American English

Moderators: Anthony Aristar, Eastern Michigan U <aristar at linguistlist.org>
            Helen Aristar-Dry, Eastern Michigan U <hdry at linguistlist.org>
 
Reviews: Laura Welcher, Rosetta Project  
       <reviews at linguistlist.org> 

Homepage: http://linguistlist.org/

The LINGUIST List is funded by Eastern Michigan University, 
and donations from subscribers and publishers.

Editor for this issue: Fatemeh Abdollahi <fatemeh at linguistlist.org>
================================================================  

To post to LINGUIST, use our convenient web form at
http://linguistlist.org/LL/posttolinguist.html.

===========================Directory==============================  

1)
Date: 20-Jun-2007
From: Mark Davies < mark_davies at byu.edu >
Subject: New 100+ Million Word Corpus of American English

 

	
-------------------------Message 1 ---------------------------------- 
Date: Thu, 21 Jun 2007 17:14:45
From: Mark Davies < mark_davies at byu.edu >
Subject: New 100+ Million Word Corpus of American English 
 


A new 100+ million word corpus of American English (1920s-2000s) is now
freely available at:

     http://corpus.byu.edu/time/

The corpus is based on more than 275,000 articles in TIME magazine from
1923 to 2006, and it contains articles on a wide range of topics - domestic
and international, sports, financial, cultural, entertainment, personal
interest, etc.

The architecture and interface is similar to the one that we have created
for our version of the British National Corpus (see
http://corpus.byu.edu/bnc), and it allows users to:

-- Find the frequency of particular words, phrases, substrings (prefixes,
suffixes, roots) in each decade from the 1920s-2000s. Users can also limit
the results by frequency in any set of years or decades. They can also see
charts that show the totals for all matching strings in each decade
(1920s-2000s), as well as each year within a given decade.

-- Study changes in syntax since the 1920s. The corpus has been tagged for
part of speech with CLAWS (the same tagger used for the BNC), and users can
easily carry out searches like the following (from among endless
possibilities): changes in the overall frequency of ''going + to + V'', or
''end up V-ing'', or preposition stranding (e.g. ''[VV*] with .''), or
phrasal verbs (1920s-1940s vs 1980s-2000s).

-- Look at changes in collocates to investigate semantic shifts during the
past 80 years. Users can find collocates up to 10 words to left or right of
node word, and sort and limit by frequency in any set of years or decades.

-- As mentioned, the interface is designed to easily permit comparisons
between different sets of decades or years. For example, with one simple
query users could find words ending in -dom that are much more frequent
1920s-40s than 1980s-1990s, nouns occurring with ''hard'' in 1940s-50s but
not in the 1960s, adjectives that are more common 2003-06 than 2000-02, or
phrasal verbs whose usage increases markedly after the 1950s, etc.

-- Users can easily create customized lists (semantically-related words,
specialized part of speech category, morphologically-related words, etc),
and then use these lists directly as part of the query syntax.

----------

For more information, please contact Mark Davies
(http://davies-linguistics.byu.edu), or visit:

     http://corpus.byu.edu/

for information and links to related corpora, including the upcoming BYU
American National Corpus [BANC] (350+ million words, 1990-2007+).

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================ 



Linguistic Field(s): Historical Linguistics
                     Lexicography
                     Text/Corpus Linguistics





 





-----------------------------------------------------------
LINGUIST List: Vol-18-1886	

	



More information about the LINGUIST mailing list