30.650, FYI: New Corpora: TV subtitles (325m) and Movies (200m)

The LINGUIST List linguist at listserv.linguistlist.org
Fri Feb 8 22:42:47 UTC 2019


LINGUIST List: Vol-30-650. Fri Feb 08 2019. ISSN: 1069 - 4875.

Subject: 30.650, FYI: New Corpora: TV subtitles (325m) and Movies (200m)

Moderator: Malgorzata E. Cavar (linguist at linguistlist.org)
Student Moderator: Jeremy Coburn
Managing Editor: Becca Morris
Team: Helen Aristar-Dry, Everett Green, Sarah Robinson, Peace Han, Nils Hjortnaes, Yiwen Zhang, Julian Dietrich
Jobs: jobs at linguistlist.org | Conferences: callconf at linguistlist.org | Pubs: pubs at linguistlist.org

Homepage: http://linguistlist.org

Please support the LL editors and operation with a donation at:
           https://funddrive.linguistlist.org/donate/

Editor for this issue: Everett Green <everett at linguistlist.org>
================================================================


Date: Fri, 08 Feb 2019 17:42:20
From: Mark Davies [mark_davies at byu.edu]
Subject: New Corpora: TV subtitles (325m) and Movies (200m)

 
We are pleased to announce two new corpora from the BYU suite of corpora:

The TV Corpus​: 325 million words in 75,000 very informal TV episodes (e.g.
comedies and dramas) from 1950-2018
https://corpus.byu.edu/tv/

The Movie Corpus: 200 million words in 25,000 movies from 1930-2018
https://corpus.byu.edu/movies/

As psycholinguistic and corpus-based research by Brysbaert and others have
shown, TV and movie subtitles often agree better with native speaker
intuitions about common, informal English than actual spoken corpora. And
while there are other corpora of subtitles, we believe that the BYU corpora
allow a much wider range of searches of these subtitles than is available
elsewhere.

As with the other BYU corpora, users can search by word, phrase, lemma, PoS,
synonym, and customized wordlists. They can see the frequency of matching
strings, the frequency in different sections of the corpora, collocates, and
re-sortable concordance lines.

The TV and Movie corpora also allow users to examine frequency and usage over
time (1930-2018 for movies, 1950-2018 for TV shows), as well as compare
between different dialects of English (for example British vs American
English).

Users can also quickly and easily create, search, and create keyword lists
from their own ''Virtual Corpora'', such as (for TV) all episodes of Dr Who,
Star Trek Next Generation, The Office, or The Good Place, or (for movies) all
James Bond movies, or all American sci-fi movies from 1990-present, which have
a certain movie rating or IMDB score, and with a given keyword in the IMDB
plot summary.

Finally, all 75,000 episodes from TV shows and all 25,000 movies are linked
directly to their IMDB entry and OpenSubtitles page. As a result, if you find
some interesting data in the corpus and want to see the original subtitles
file or find out more about the TV show or movie (actors, rating, extended
plot summary, etc), it's just one click away.

In summary, we believe that the new TV Corpus and Movie Corpus provide are the
largest, most searchable corpora of very informal English, and we hope that
they are of value to you in your research and teaching.

--
Brief overview (PDF) of the TV and Movie corpora
https://corpus.byu.edu/files/tv_movie_corpora.pdf
--

Also, we're glad to announce that ''one click'' comparisons in the BYU corpora
are back, which allows you to seamlessly move between and compare results in
the different BYU corpora (e.g. TV, Movies, Soap Operas, iWeb, COCA, COHA,
GloWbE, BYU-BNC, NOW, Wikipedia, and others).

Mark Davies
BYU Corpora
 



Linguistic Field(s): Computational Linguistics
                     Historical Linguistics
                     Lexicography
                     Text/Corpus Linguistics

Subject Language(s): English (eng)





 



------------------------------------------------------------------------------

*****************    LINGUIST List Support    *****************
Please support the LL editors and operation with a donation at:

              The IU Foundation Crowd Funding site:
       https://iufoundation.fundly.com/the-linguist-list

               The LINGUIST List FundDrive Page:
            https://funddrive.linguistlist.org/donate/
 


----------------------------------------------------------
LINGUIST List: Vol-30-650	
----------------------------------------------------------






More information about the LINGUIST mailing list