[Corpora-List] Spoken English Corpus of 19th and 20th Century

Mark Davies Mark_Davies at byu.edu
Mon Dec 6 14:46:19 UTC 2010


 Ji Won,

>> Anybody have access to movie and/or TV scripts or plays written in 1850-1950s or know of the existence of such corpora?

The Corpus of *Historical* American English (COHA; http://corpus.byu.edu/coha) contains 400 million words from the 1810s to the 2000s. This includes 11.0 million words in plays (every decade, 1810s-2000s) and 6.2 million words from movie scripts (1930s-2000s). For complete information on the 100,000+ texts in the corpus (with summaries by genre, source, and decade), see http://corpus.byu.edu/coha/files/cohaTexts.xls (17MB Excel file).

One issue, though, is that while COHA searches can be limited to macro-genre (e.g. fiction), it is not currently possible to limit to sub-genre (e.g. just plays or just movies), as one can do with the Corpus of *Contemporary* American English (COCA; 410 million words, 1990-2010; http://www.americancorpus.org). I guess I could change this, though, to allow such searches. Please let me know if this would be useful for you.

If you would prefer to create your own offline corpus, the materials are readily available. There are many sites that have movie scripts back to the 1930s (e.g. http://www.simplyscripts.com/ or http://www.script-o-rama.com/) and radio scripts from the 1930s-1940s (e.g. http://www.genericradio.com/library.php). There are also many online archives for plays, such as the Library of Congress collection (1870s-1920s; http://lcweb2.loc.gov/ammem/vshtml/vseng.html). As far as other "spoken", you might also look for oral history collections online. These are often a bit problematic, however, since many of these appear to have been overly cleaned up in the process of transcribing them.

It should be quite easy to quickly create a small 3-5 million word corpus based on these materials, if something that small would be useful for your research (sometimes 400 million words is overkill).

I hope this helps.

Mark D.

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: http://davies-linguistics.byu.edu
 
** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================


From: corpora-bounces at uib.no [corpora-bounces at uib.no] On Behalf Of Ji Won LEE [jiwonlee at buffalo.edu]
Sent: Sunday, December 05, 2010 8:12 AM
To: corpora at uib.no
Subject: [Corpora-List] Spoken English Corpus of 19th and 20th Century


Hello All,
I am comparing the diachronic change of spoken and written English up to this point and was wondering if such corpus exists.
 
I know a recording is  a pretty recent technology so it won't be comparable to Switchboard corpus but 
even movie/TV scripts of late 19th century or early 20th century would be most helpful.
 
Anybody have access to movie and/or TV scripts or plays written in 1850-1950s 
or 
know of the existence of such corpora?
 
Your help is greatly appreciated.
 
Thanks,
 JiWon LEE
 
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list