[Corpora-List] What is corpora and what is not?

Trevor Jenkins trevor.jenkins at suneidesis.com
Thu Oct 4 08:31:18 UTC 2012


On 4 Oct 2012, at 01:18, Mark Davies <Mark_Davies at byu.edu> wrote:

>>> Even the newly appeared American Soap Operas corpus on Mark Davies site is still constructed and ultimately "high-brow".
> 
> Maybe, but see:
> 
> http://corpus2.byu.edu/soap/overview.asp (comparison to the spoken portion of the BNC; in many respects, much more colloquial than the BNC spoken)
> http://corpus2.byu.edu/soap/ (Soap Opera corpus itself; 100 million words)

I think you make my point in that overview with the observations about the costs of compiling spoken corpora.

In the description of American Soap Operas there are these claims…

> the theory that the dialogue in most TV shows and movies represents the spoken language pretty well.
and 
> we would suggest that subtitles from informal TV shows and movies does represent the informal, everyday language quite well -- especially soap operas. 

It's that theory I challenge. I don't believe that scripted and rehearsed productions represent spoken language very well at all because soap operas are not *informal* at all being as they are scripted productions created in highly formalised environments (writers' room, studio, post-production suite, etc). Soap operas are no more informal than presidential candidate debates are unscripted. Soaps *settings* may be informal --- homes, offices, schools --- but the language is every bit as formal as that in the printed material collected in other corpora.

It's a 2-space issue formal/informal x language/setting. Setting is no guide to language. Sadly many, confuse the informal nature of setting with the informal use of language; I encounter this with teachers of (British) sign language were informal /setting/ is often labeled as informal language.  The problem is more likely to be an N-space as one has to account for L1/L2 interaction, varying age of participants, differing status of participants (parent/child, employer/employee, teacher/pupil/ lecturer/students) and many other factors not least the Humpty Dumpty effect. 

Of the 10 soap exemplars you use their credits all include the telling phrase "written by". While the language use may reflect some current informal phrase usages none of the content is primary source. We also have intrusion; the British sketch shows Little Britain and the Catherine Tate Show pretty much created new phrases. Little Britain's character Vicky Pollard with her "yeah but no but yeah" marker and Catherine Tate with her "am I bovered" comment were taken up with great alacrity by school children and young adults.

There is a British sitcom called Outnumbered that has some improvised dialogue because of the ages of some of the cast (when the series started the oldest child cast member was 11 and the youngest 5 or 6 the third was 7 or 8). Their contributions were guided in rehearsal but not explicitly scripted so their specific lexical choices represent the social class the children were born into but that is still somewhat high brow as the parents porn film producers, sports reporters and actors. Even with the freedom to improvise the language reflects a specific class (A/B1) rather than being representative of "the 99%".

Having a corpus of transcripts of confrontational shows such as those of Jerry Springer, perhaps Ricki Lake, or the British Jeremy Kyle in which more vernacular language is included, although often censored by the sound department, might meet those two claims (good representation of normal spoken English and informal usage). Or possibly better transcripts of the unredacted 24hour live feeds of reality shows like Big Brother. But there's still a selection process involved which skews the language used. 

Now the irony is that until such time as a large scale corpus of truly informal unrehearsed unscripted utterances exists we won't be able to do any comparisons between the lexical choices and grammar constructions of normal language.

Regards, Trevor.

<>< Re: deemed!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121004/37e08f37/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list