[Corpora-List] What is corpora and what is not?

Trevor Jenkins trevor.jenkins at suneidesis.com
Thu Oct 4 10:17:20 UTC 2012


:-)

It's almost lunch time for me so I'll offer this metaphor.

I liken corpora content to cuisines. Corpora compilers want Michelin starred restaurant cuisine for themselves but recognising that this is not what everyone actually eats they instead prepare meals from the recipe books of celebrity chefs (e.g. Aleggra McEvedy, Jamie Oliver, Gordon Ramsey, Yotam Ottolenghi) or from up-market home cooks (such as the UK's doyen of home cookery Mary Berry whose current book just happens to be on my desk as I plan this evening's family meal). But this ignores the fact that the cuisine of choice of the group being studying is a big Mac with cheese, up-sized fries, and a super-sized Coke.

On 4 Oct 2012, at 10:38, Piotr Pezik <pezik at uni.lodz.pl> wrote:
> Having been involved in the process of acquiring both conversational and "on-air" spoken language data for the National Corpus of Polish (NKJP), I'd have to strongly agree with Trevor's remarks.
> 
> I think the American Soap Operas Corpus, although a very valuable resource in its own right, represents written-to-be-spoken rather than spoken language. Soap opera scripts are essentially their authors's impressions of casual spoken language, not that much different from linguistically realistic dialogues you might fine in a novel or a play. They often are an accurate reflection of (a particular breed of) spoken language and sometimes they are even an exaggerated impression, which is why you might find them to be more spoken than the conversational part of the BNC (the plus-catholique-que-le-pape effect), but they're not the real thing simply because they are written and edited and not produced with the real time constraints of casual spoken discourse.
> 
> Live TV shows are closer to casual spoken discourse, although still very different, if you consider their pragmatic discourse structure among other dimensions of comparison. For example, it is fairly obvious that while speaking to anyone in the studio, politicians and celebrities generally tend to "communicate” to their viewers/voters. On-air spoken language is different from what you get when the cameras and microphones are switched off. 
> 
> Regards, 
> 
> Piotr 
> 
> 
> On Thu, Oct 4, 2012 at 10:31 AM, Trevor Jenkins <trevor.jenkins at suneidesis.com> wrote:
> On 4 Oct 2012, at 01:18, Mark Davies <Mark_Davies at byu.edu> wrote:
> 
>>>> Even the newly appeared American Soap Operas corpus on Mark Davies site is still constructed and ultimately "high-brow".
>> 
>> Maybe, but see:
>> 
>> http://corpus2.byu.edu/soap/overview.asp (comparison to the spoken portion of the BNC; in many respects, much more colloquial than the BNC spoken)
>> http://corpus2.byu.edu/soap/ (Soap Opera corpus itself; 100 million words)
> 
> I think you make my point in that overview with the observations about the costs of compiling spoken corpora.
> 
> In the description of American Soap Operas there are these claims…
> 
>> the theory that the dialogue in most TV shows and movies represents the spoken language pretty well.
> and 
>> we would suggest that subtitles from informal TV shows and movies does represent the informal, everyday language quite well -- especially soap operas. 
> 
> It's that theory I challenge. I don't believe that scripted and rehearsed productions represent spoken language very well at all because soap operas are not *informal* at all being as they are scripted productions created in highly formalised environments (writers' room, studio, post-production suite, etc). Soap operas are no more informal than presidential candidate debates are unscripted. Soaps *settings* may be informal --- homes, offices, schools --- but the language is every bit as formal as that in the printed material collected in other corpora.
> 
> It's a 2-space issue formal/informal x language/setting. Setting is no guide to language. Sadly many, confuse the informal nature of setting with the informal use of language; I encounter this with teachers of (British) sign language were informal /setting/ is often labeled as informal language.  The problem is more likely to be an N-space as one has to account for L1/L2 interaction, varying age of participants, differing status of participants (parent/child, employer/employee, teacher/pupil/ lecturer/students) and many other factors not least the Humpty Dumpty effect. 
> 
> Of the 10 soap exemplars you use their credits all include the telling phrase "written by". While the language use may reflect some current informal phrase usages none of the content is primary source. We also have intrusion; the British sketch shows Little Britain and the Catherine Tate Show pretty much created new phrases. Little Britain's character Vicky Pollard with her "yeah but no but yeah" marker and Catherine Tate with her "am I bovered" comment were taken up with great alacrity by school children and young adults.
> 
> There is a British sitcom called Outnumbered that has some improvised dialogue because of the ages of some of the cast (when the series started the oldest child cast member was 11 and the youngest 5 or 6 the third was 7 or 8). Their contributions were guided in rehearsal but not explicitly scripted so their specific lexical choices represent the social class the children were born into but that is still somewhat high brow as the parents porn film producers, sports reporters and actors. Even with the freedom to improvise the language reflects a specific class (A/B1) rather than being representative of "the 99%".
> 
> Having a corpus of transcripts of confrontational shows such as those of Jerry Springer, perhaps Ricki Lake, or the British Jeremy Kyle in which more vernacular language is included, although often censored by the sound department, might meet those two claims (good representation of normal spoken English and informal usage). Or possibly better transcripts of the unredacted 24hour live feeds of reality shows like Big Brother. But there's still a selection process involved which skews the language used. 
> 
> Now the irony is that until such time as a large scale corpus of truly informal unrehearsed unscripted utterances exists we won't be able to do any comparisons between the lexical choices and grammar constructions of normal language.
> 
> Regards, Trevor.
> 
> <>< Re: deemed!
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 
> 

Regards, Trevor.

<>< Re: deemed!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121004/d1d527ef/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list