[Corpora-List] labels of COLT files in BNC spoken

Eric Atwell eric at comp.leeds.ac.uk
Thu Nov 13 13:00:26 UTC 2003


Lou,
thanks for this expert clarification.
Demo chatbots trained with a variety of BNC files are now on my web-page
http://www.comp.leeds.ac.uk/eric/  and we can add more ....

- I have a follow-up question:  can you suggest any specific BNC spoken
  files which illustrate particularly "interesting" / idiosyncratic
  language use?  For example, the BNC file with the most swearing? :)
  We want to identify a selection of "unusual" files, to train
  a collection of noticeably different chatbots.

thanks

Eric

On 13 Nov 2003, Lou Burnard wrote:

> Apologies for not contributing to this enquiry sooner. A number of
> different issues seem to be confused here:
>
> 1. Which bits of COLT also appear in the BNC?
> 2. How do I find out which bits of the BNC contain London teenage
> speech?
> 3. Is "ain't" characteristic of spoken London teenage language?
>
>
> Here's what *I* think on each of these (see also
> http://www.hf.uib.no/i/Engelsk/colt/COLTinfo.html):
>
> 1. None! COLT is the brainchild of Anna Brita Stenstrom and colleagues
> at Bergen. With funding from Longman and others, they collected the
> audio material which is the "fons et origo" of this material. Longman
> made a transcription of (most of) this audio material and contributed it
> to the BNC. Bergen made a *different* transcription of (most of) the
> same audio, using different conventions, and different markup, and also
> substantially revised the part of speech tagging. The result was
> eventually published as COLT. They did not include any way of linking
> their transcription to the older transcription in the BNC, in particular
> they did not specify which files correspond with which. The BNC files of
> course combine all conversations collected by a single respondent into
> one file, whereas Colt has them in separate files.
>
> 2. Easy. Look at the <catRef> element in the header of each text and
> select those which have appropriate values: (sdeage1 sdeage2 sporeg1 to
> be exact). This gives 43 texts thus classified. You could further refine
> this by looking for words like London in the header, of course, but it
> probably isn't worth the effort.
>
> 3. Hmm. The problem is in the transcription. As Ylva Berglund found in
> her study of "innit", any pronouncements about relative rates of these
> quasi-lexicalized words in speech and writing have to be hedged around
> with all sorts of caution. The BNC speech transcriptions went through at
> least two normalization stages -- one using the transcriber's judgment
> as to what was intended, and the other using an automatic spelling
> correction tool. Paradoxically, I would expect "aint" or "ent" or
> "innit" to get tidied up into "isn't" disproportionately more often in
> the spoken transcripts than in the written texts, precisely for that
> reason. You can't argue with "ain't" when it's there in black and white
> on the page. The COLT speech transcription, however, was made by people
> with a different agenda, and so I would expect them to both more
> sensitive to and more likely to wish to record such variation than the
> BNC speech transcribers.
>
> Lou Burnard
>
> On Thu, 2003-11-13 at 07:38, Ute Römer wrote:
> > Dear Eric, Bayan, and others,
> >
> >
> > > but as far as I know there isnt anything in BNC documentation equivalent
> > to a list of filenames of files from COLT
> >
> > That's too bad. I was sure there had to exist such a list somewhere but
> > apparently it doesn't (or nobody knows about it).
> >
> > I'm not 100% sure yet (more concordance checks required), but I think I've
> > found the 377 COLT files. Last night I scrolled through the list of BNC
> > texts (in SARA; unfortunately, it's not possible to copy and past this list
> > to search it automatically) and checked the bibliographic reference for
> > quite a number of those labelled "n conversations recorded by X" in the
> > list. It looks as if files KNR to KR2 and KSN to KSW (51 files, consisting
> > of 1 to 39 conversations each) are COLT files, or most of them at least. You
> > get information like
> >
> > "<hi>7 conversations recorded by `Robin' (PS58K) [dates unknown] with 6
> > interlocutors, totalling 1126 s-units, 5165 words (duration not
> > recorded).</hi>
> >
> > PS58K `Robin', 14, student, AB, male
> >
> > PS58L `Jones'teacher, male
> >
> > PS58M `Zoe', 13, student, female
> >
> > PS58N `Ben', 14, student, male
> >
> > PS58P `Oliver', 13, student, male
> >
> > PS5AV `Jenny', 13, student, female"
> >
> > -- sounds very COLTish to me.
> >
> > Also, I had a look at some headers of these files (checked the BNC texts in
> > version 1.0 though) and spotted lots of COLT key items like "Hackney" or
> > "Greater London". I then saved these 51 BNC files as a subcorpus and did a
> > concordance check of "ai" in this collection (using SARA2) and of "ain"
> > ("ai" didn't work here) in the real COLT (using WST). I found 307
> > occurrences in my supposed COLT and 293 in the real one - not 100%
> > convincing but not too bad either.
> >
> > However, if these files (my saved "COLT?" BNC subcorpus) really make up
> > COLT, then most of my occurrences of "ain't" are not from teenage language.
> > So, unfortunately, all that searching, browsing, and alerting you hasn't
> > really solved my problem. Anyway, I guess I know a bit more about the BNC
> > and COLT contents now (and about the importance of knowing exactly what's in
> > your corpus - and, ideally, where it is).
> >
> > Thanks to Eric and to Linda Bawcom (who contacted me off the list).
> >
> > Best from Hanover... Ute
> >
> >
> > ************************************************************
> >
> > Ute Römer
> > English Department
> > University of Hanover
> > Königsworther Platz 1
> > 30167 Hannover
> > Germany
> >
> > Phone: +49 (0)511 762 2997
> > Fax: +49 (0)511 762 2996
> > E-mail: ute.roemer at anglistik.uni-hannover.de
> > http://www.fbls.uni-hannover.de/angli/
> >
> >
> > > Bayan ended up searching all
> > > spoken transcript files including teenager speakers (speaker age is in
> > > the header info).
> > >
> > > If you (or soemone else) discovers a solution, do please let us know...
> > >
> > > and in the meantime, feel free to try out the chatbots we have trained
> > > on various BNC files at http://www.comp.leeds.ac.uk/eric/
> > >
> > > - we have to demo these at the BCS Machine Intelligence contest at
> > >   Cambridge Univ, December 16th, as an example of Machine Learning used
> > >   to visualise sublanguage ... so feedback to help us carry off the
> > >   trophy and GBP1000 cash prize is welcome!!!
> > >
> > > cheers
> > >
> > > eric atwell
> > >
> > >
> > > On Tue, 11 Nov 2003, Ute Römer wrote:
> > >
> > > > Dear all,
> > > >
> > > > I was wondering if anyone of you could tell me which text files in the
> > BNC are COLT files. I checked David Lee's Excel spreadsheet and the BNC
> > World list of texts (on the SARA2 start page) but didn't find the
> > information I was hoping to get (maybe I didn't search long enough though).
> > > > The thing is that I'm trying to nail down repeated occurrences of "ai
> > n't" plus progressive form (and missing form of TO BE plus progressive form)
> > in BNC (spoken) data which I don't get in my Bank of English (brspok) data.
> > I thought that the amount of teenage and adolescent language in the BNC
> > might be a possible explanation for fragmentary constructions. It's not a
> > big thing, really, and I suppose I could check the headers of all the BNC
> > files my concordance examples come from (to see how old the participants
> > are), but maybe there is an easier/faster option.
> > > >
> > > > Thanks in advance and best wishes. Ute
> > > >
> > > >
> > > > ************************************************************
> > > >
> > > > Ute Römer
> > > > English Department
> > > > University of Hanover
> > > > Königsworther Platz 1
> > > > 30167 Hannover
> > > > Germany
> > > >
> > > > Phone: +49 (0)511 762 2997
> > > > Fax: +49 (0)511 762 2996
> > > > E-mail: ute.roemer at anglistik.uni-hannover.de
> > > > http://www.fbls.uni-hannover.de/angli/
> > > >
> > > >
> > >
> > > --
> > > Eric Atwell, Senior Lecturer, Computer Vision and Language research group
> > > Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
> > > School of Computing, University of Leeds, LEEDS LS2 9JT
> > > TEL: 0113-3435761  MOBILE: 0775-1039104 FAX: 0113-3435468
> > > WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk
> > > Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
> > >
> > >
> > >
> >
> >
> >
>
>
>

--
Eric Atwell, Senior Lecturer, Computer Vision and Language research group
Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
School of Computing, University of Leeds, LEEDS LS2 9JT
TEL: 0113-3435761  MOBILE: 0775-1039104 FAX: 0113-3435468
WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk
Visit http://www.computingLEEDS.ac.uk - our newsletter for industry



More information about the Corpora mailing list