[Corpora-List] What is corpora and what is not?

Piotr Pezik pezik at uni.lodz.pl
Thu Oct 4 09:38:43 UTC 2012


Having been involved in the process of acquiring both conversational and
"on-air" spoken language data for the National Corpus of Polish (NKJP), I'd
have to strongly agree with Trevor's remarks.

I think the American Soap Operas Corpus, although a very valuable resource
in its own right, represents written-to-be-spoken rather
than spoken language. Soap opera scripts are essentially their authors's
impressions of casual spoken language, not that much different from
linguistically realistic dialogues you might fine in a novel or a play.
They often are an accurate reflection of (a particular breed of) spoken
language and sometimes they are even an exaggerated impression, which is
why you might find them to be more spoken than the conversational part of
the BNC (the *plus-catholique-que-le-pape* effect), but they're not the
real thing simply because they are written and edited and not produced with
the real time constraints of casual spoken discourse.

Live TV shows are closer to casual spoken discourse, although still very
different, if you consider their pragmatic discourse structure among other
dimensions of comparison. For example, it is fairly obvious that while
speaking to anyone in the studio, politicians and celebrities generally
tend to "communicate” to their viewers/voters. On-air spoken language is
different from what you get when the cameras and microphones are switched
off.

Regards,

Piotr

On Thu, Oct 4, 2012 at 10:31 AM, Trevor Jenkins <
trevor.jenkins at suneidesis.com> wrote:

> On 4 Oct 2012, at 01:18, Mark Davies <Mark_Davies at byu.edu> wrote:
>
> Even the newly appeared American Soap Operas corpus on Mark Davies site is
> still constructed and ultimately "high-brow".
>
>
> Maybe, but see:
>
> http://corpus2.byu.edu/soap/overview.asp (comparison to the spoken
> portion of the BNC; in many respects, much more colloquial than the BNC
> spoken)
> http://corpus2.byu.edu/soap/ (Soap Opera corpus itself; 100 million words)
>
>
> I think you make my point in that overview with the observations about the
> costs of compiling spoken corpora.
>
> In the description of American Soap Operas there are these claims…
>
> the theory that the dialogue in most TV shows and movies represents the
> spoken language pretty well.
>
> and
>
> we would suggest that subtitles from *informal *TV shows and movies does
> represent the informal, everyday language quite well -- especially *soap
> operas*.
>
>
> It's that theory I challenge. I don't believe that scripted and rehearsed
> productions represent spoken language very well at all because soap operas
> are not **informal** at all being as they are scripted productions
> created in highly formalised environments (writers' room, studio,
> post-production suite, etc). Soap operas are no more informal than
> presidential candidate debates are unscripted. Soaps **settings** may be
> informal --- homes, offices, schools --- but the language is every bit as
> formal as that in the printed material collected in other corpora.
>
> It's a 2-space issue formal/informal x language/setting. Setting is no
> guide to language. Sadly many, confuse the informal nature of setting with
> the informal use of language; I encounter this with teachers of (British)
> sign language were informal /setting/ is often labeled as informal
> language.  The problem is more likely to be an N-space as one has to
> account for L1/L2 interaction, varying age of participants, differing
> status of participants (parent/child, employer/employee, teacher/pupil/
> lecturer/students) and many other factors not least the Humpty Dumpty
> effect.
>
> Of the 10 soap exemplars you use their credits all include the telling
> phrase "written by". While the language use may reflect some current
> informal phrase usages none of the content is primary source. We also have
> intrusion; the British sketch shows Little Britain and the Catherine Tate
> Show pretty much created new phrases. Little Britain's character Vicky
> Pollard with her "yeah but no but yeah" marker and Catherine Tate with her
> "am I bovered" comment were taken up with great alacrity by school children
> and young adults.
>
> There is a British sitcom called Outnumbered that has some improvised
> dialogue because of the ages of some of the cast (when the series started
> the oldest child cast member was 11 and the youngest 5 or 6 the third was 7
> or 8). Their contributions were guided in rehearsal but not explicitly
> scripted so their specific lexical choices represent the social class the
> children were born into but that is still somewhat high brow as the parents
> porn film producers, sports reporters and actors. Even with the freedom to
> improvise the language reflects a specific class (A/B1) rather than being
> representative of "the 99%".
>
> Having a corpus of transcripts of confrontational shows such as those of
> Jerry Springer, perhaps Ricki Lake, or the British Jeremy Kyle in which
> more vernacular language is included, although often censored by the sound
> department, might meet those two claims (good representation of normal
> spoken English and informal usage). Or possibly better transcripts of the
> unredacted 24hour live feeds of reality shows like Big Brother. But there's
> still a selection process involved which skews the language used.
>
> Now the irony is that until such time as a large scale corpus of truly
> informal unrehearsed unscripted utterances exists we won't be able to do
> any comparisons between the lexical choices and grammar constructions of
> normal language.
>
> Regards, Trevor.
>
> <>< Re: deemed!
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121004/756cbc32/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list