[Corpora-List] labels of COLT files in BNC spoken

Ute Römer ute.roemer at anglistik.uni-hannover.de
Thu Nov 13 07:38:59 UTC 2003


Dear Eric, Bayan, and others,


> but as far as I know there isnt anything in BNC documentation equivalent
to a list of filenames of files from COLT

That's too bad. I was sure there had to exist such a list somewhere but
apparently it doesn't (or nobody knows about it).

I'm not 100% sure yet (more concordance checks required), but I think I've
found the 377 COLT files. Last night I scrolled through the list of BNC
texts (in SARA; unfortunately, it's not possible to copy and past this list
to search it automatically) and checked the bibliographic reference for
quite a number of those labelled "n conversations recorded by X" in the
list. It looks as if files KNR to KR2 and KSN to KSW (51 files, consisting
of 1 to 39 conversations each) are COLT files, or most of them at least. You
get information like

"<hi>7 conversations recorded by `Robin' (PS58K) [dates unknown] with 6
interlocutors, totalling 1126 s-units, 5165 words (duration not
recorded).</hi>

PS58K `Robin', 14, student, AB, male

PS58L `Jones'teacher, male

PS58M `Zoe', 13, student, female

PS58N `Ben', 14, student, male

PS58P `Oliver', 13, student, male

PS5AV `Jenny', 13, student, female"

-- sounds very COLTish to me.

Also, I had a look at some headers of these files (checked the BNC texts in
version 1.0 though) and spotted lots of COLT key items like "Hackney" or
"Greater London". I then saved these 51 BNC files as a subcorpus and did a
concordance check of "ai" in this collection (using SARA2) and of "ain"
("ai" didn't work here) in the real COLT (using WST). I found 307
occurrences in my supposed COLT and 293 in the real one - not 100%
convincing but not too bad either.

However, if these files (my saved "COLT?" BNC subcorpus) really make up
COLT, then most of my occurrences of "ain't" are not from teenage language.
So, unfortunately, all that searching, browsing, and alerting you hasn't
really solved my problem. Anyway, I guess I know a bit more about the BNC
and COLT contents now (and about the importance of knowing exactly what's in
your corpus - and, ideally, where it is).

Thanks to Eric and to Linda Bawcom (who contacted me off the list).

Best from Hanover... Ute


************************************************************

Ute Römer
English Department
University of Hanover
Königsworther Platz 1
30167 Hannover
Germany

Phone: +49 (0)511 762 2997
Fax: +49 (0)511 762 2996
E-mail: ute.roemer at anglistik.uni-hannover.de
http://www.fbls.uni-hannover.de/angli/


> Bayan ended up searching all
> spoken transcript files including teenager speakers (speaker age is in
> the header info).
>
> If you (or soemone else) discovers a solution, do please let us know...
>
> and in the meantime, feel free to try out the chatbots we have trained
> on various BNC files at http://www.comp.leeds.ac.uk/eric/
>
> - we have to demo these at the BCS Machine Intelligence contest at
>   Cambridge Univ, December 16th, as an example of Machine Learning used
>   to visualise sublanguage ... so feedback to help us carry off the
>   trophy and GBP1000 cash prize is welcome!!!
>
> cheers
>
> eric atwell
>
>
> On Tue, 11 Nov 2003, Ute Römer wrote:
>
> > Dear all,
> >
> > I was wondering if anyone of you could tell me which text files in the
BNC are COLT files. I checked David Lee's Excel spreadsheet and the BNC
World list of texts (on the SARA2 start page) but didn't find the
information I was hoping to get (maybe I didn't search long enough though).
> > The thing is that I'm trying to nail down repeated occurrences of "ai
n't" plus progressive form (and missing form of TO BE plus progressive form)
in BNC (spoken) data which I don't get in my Bank of English (brspok) data.
I thought that the amount of teenage and adolescent language in the BNC
might be a possible explanation for fragmentary constructions. It's not a
big thing, really, and I suppose I could check the headers of all the BNC
files my concordance examples come from (to see how old the participants
are), but maybe there is an easier/faster option.
> >
> > Thanks in advance and best wishes. Ute
> >
> >
> > ************************************************************
> >
> > Ute Römer
> > English Department
> > University of Hanover
> > Königsworther Platz 1
> > 30167 Hannover
> > Germany
> >
> > Phone: +49 (0)511 762 2997
> > Fax: +49 (0)511 762 2996
> > E-mail: ute.roemer at anglistik.uni-hannover.de
> > http://www.fbls.uni-hannover.de/angli/
> >
> >
>
> --
> Eric Atwell, Senior Lecturer, Computer Vision and Language research group
> Distributed Multimedia Systems MSc Tutor & SOCRATES/JYA Tutor
> School of Computing, University of Leeds, LEEDS LS2 9JT
> TEL: 0113-3435761  MOBILE: 0775-1039104 FAX: 0113-3435468
> WWW: http://www.comp.leeds.ac.uk/eric  EMAIL: eric at comp.leeds.ac.uk
> Visit http://www.computingLEEDS.ac.uk - our newsletter for industry
>
>
>



More information about the Corpora mailing list