Corpora: Announcement: BNC Index

David Lee david_lee00 at hotmail.com
Mon Apr 23 13:56:06 UTC 2001


Dear All,

Every now and then, there are requests for corpora/subcorpora of
specific genres of English. Recently, for example, there were requests
for “academic EFL/ESL texts” and another one for “business English”. In
the past, people have also asked for things like “medical language”,
“e-mail discussions” or “children’s writing”.

If it’s British English you’re after, there is perhaps no better place
to start than with the British National Corpus
(http://info.ox.ac.uk/bnc/),
which contains all the above (sub)genres and more. However, up till now,
it’s been very difficult for most end-users to quickly browse/search the
BNC by genre or by a combination of criteria such as audience age,
author age, domain of discourse, medium, audience level, etc. in order
to find specific texts which fit specific research needs precisely. I
suspect this difficulty is why many people never think of looking in the
BNC for what they want.

At TALC 2000 in Graz, I first announced the work that I had been doing
on categorising all the BNC texts in terms of genre (e.g.
Written_Academic_Prose_Social Sciences; Written_Imaginative_Poetry;
Spoken_Consultations; Spoken_Courtroom_Discourse). I would like to now
announce that this resource that I’ve been working on, called the "BNC
Index", is now available for use (in spreadsheet format & other
incarnations, see below). This genre classification of texts has also
been incorporated into the headers of the 4,055 files of the new BNC
World Edition. (The BNC Index itself, however, covers all the 4,124
files of BNC Version 1.)

==========

The BNC Index itself, in Microsoft Excel spreadsheet format, is
available from:
http://members.nbci.com/davidlee00/corpus_resources.htm


If you don’t like spreadsheets or would like an easier interface, try
the BNC Web Indexer (a front end to the BNC Index) at:
http://www.comp.lancs.ac.uk/computing/research/ucrel/bncindex/

(Access is not restricted, but please register your details on the
welcome page and read the documentation & caveats before using.)


Alternatively, you can download the stand-alone program written by
Antonio Ortiz (who announced this recently on the list):
http://webdeptos.uma.es/filifa/personal/amoreno/indexer


The differences between the last two facilities:

(1) The BNC *Web* Indexer and spreadsheet will be updated regularly,
whenever errors are spotted and reported to me, whereas Ortiz’s
stand-alone BNC Indexer will be updated as and when time permits. (At
time of writing, Ortiz' program has not included my latest changes, and
is thus not up-to-date.)

(2) At present the *Web* Indexer doesn't allow selection of more than
one option within each field/category (e.g. you cannot select more than
one genre, more than one author age range, and so on). The *stand-alone*
Indexer does. (Multiple selections are also possible, of course, if you
use the spreadsheet.) This limitation will (hopefully) be fixed soon.

So... choose according to your needs.

==========

These resources will allow users to scan the BNC by genre (24 spoken and
46 written genres) and a number of other criteria (time period, audience
level, spontaneity, library keywords, bibliographical details, etc.)

But note the following caveats:

(1) genre classifications were done within time constraints, so I would
advise manual checks on search results where possible.

(2) read the documentation on the categorisation scheme before
proceeding.


The point of the BNC Index (or Indexers) is to enable researchers
(esp. those not particularly computer-literate) to obtain lists of
BNC file IDs for constructing their own particular sub-corpora for use
with stand-alone PC concordancers such as WordSmith or MonoConc (which
allow users to specify a list of files as a subcorpus to restrict
queries to).

The server-based SARA and BNCWeb programs can already do this, but they
don’t allow pure part-of-speech-tag searches. People using stand-alone
PC concordancers for this reason can now specify subcorpora at the file
level by first using the BNC Index to obtain relevant file IDs.

I hope some people will find this useful.


David Lee

-----------------------------------------------------------------
David YW Lee
Visiting Researcher
Dept of Linguistics
Lancaster University
Lancaster LA1 4YT
England, UK.

Email: david_lee00 at hotmail.com
-----------------------------------------------------------------



More information about the Corpora mailing list