Corpora: Looking for "ing" - re Ute Römer''s request for help

Tue Feb 22 15:43:57 UTC 2000

Dear All,

I've been moved by Ute Romer's request (see below) to put out a note to the
list that I'd been meaning to send
for some time.

>I'm wondering whether one of you could possibly help me with a
>research project on future expressions in English. I'm looking
>for several structures in the spoken part of the British
>National Corpus and I have some problems to find types like
>"VERBing", "will be VERBing" and so on.
>
>Is there a possibility to find all present progressive forms
>without doing a separate query on every single verb, i. e.
>is it possible to insert some kind of "place marker"
>indicating "base form of lexical verb"?

What Ute is asking to do (noting also the various caveats about the
reliability of coding in BNC) has been dead easy for a long time for anyone
who *hasn't* been using SARA.  My own work, and the work of a lot of other
people, has been made possible by the fact that Mike Scott has been able to
put together a program that can search selectively for POS tags in large
corpora (Wordsmith Tools - http://www.liv.ac.uk/~ms2928/homepage.html), and
that the BNC can be unpacked from the distribution CDROM and worked with in
a tack DOS environment as a set of plain text files.  Others have already
indicated that you can use Wordsmith or Monconc Pro to hunt for POS tags in
the BNC -- the problem that many of us/you might have is actually
transferring the corpus from the distribution disks to a DOS partition on a
hard disk (it only takes about 1.5 GB of disk space for the WHOLE corpus).

With this in mind, I've put together a brief account below of how you can
do this using ordinary DOS tools.  I know that the real corpus linguists
out there could have whacked together a PERL script that would have sorted
this out in 30 seconds on a 386 Linux box -- but I'm still at the bottom of
that learning curve and want to ask the corpus questions that Sara can't
help me with.  So for I offer this to any who might be interested.  It
works.  If you set Wordsmith TAGS to include "<", ">" as characters, and
use the search string "*V*G>*ING" you can get results like the brief
extract below.

393	           <w AJC>smarter <w PRP>by <w VVG>making <w AT0>a <w ORD
394	          e> <w CJC>And <w AV0>then <w VVG>moving <w AT0>the <w UN
395	        UN>, <w DT0>that<w VBZ>'s   <w VVG>going <w TO0>to <w VVI>lo
396	        N1>spatula <w PRP>for   <w AJ0-VVG>spreading <w NN1>glue<c P
397	        N1>spatula <w PRP>for   <w AJ0-VVG>spreading <w NN1>glue<c P
398	         lly <w VHZ>has <w VBN>been <w VDG>doing <w NN2>things <w A
399	           <w VHZ>has   <w VBN>been <w VDG>doing <w PNP>them<c PUN
400	           ou<w VHB>'ve <w VBN>been <w VVG>using <w AT0>a   <w NN1>
401	          ve <w PNP>you <w VBN>been <w VDG>doing <w AV0>wrong   <w N
402	           ou<w VHB>'ve <w VBN>been <w VVG>using <w AT0>a   <w NN1>

Useful?

---------------------
Using the BNC on a PC
Some guidelines anyone with an interest in using the BNC on a Windows
computer

The BNC

The British National Corpus is a potentially important resource for
teachers and researchers, but it was designed with the needs of a narrower
community in mind than the one that I belong to, and at the moment remains
intimidating and impenetrable for most PC users (the PC being the main IT
resource for the rest of us these days).  Problems arise at a number of
levels - viz:

1.	unpacking the files from the installation CDROM
2.	identifying which files might be useful
3.	working with the corpus

This note deals with the first two of these points.  It is a bodger's guide
to setting up the BNC for people with a reasonable understanding of how to
use a Windows PC.  It is not a definitive guide to all the things you can
do with BNC once you've got it set up.

Unpacking

The BNC comes on 3 CDs (at a cost of around £250 for all three - it may be
possible to negotiate the purchase of CD 1 only - I should have done this,
but did not realise you only need disk 1 to use BNC on a PC!).  These CDs
contain the BNC data files and a whole host of other applications -
especially SARA, the search engine specifically designed for the corpus.
All of these appear to require Unix or a Unix like operating system such as
Linux for the PC to be installed.  My assumption is that the rest of us do
not want the learning curve involved in setting up Unix on their machines,
and that more teachers will use BNC if they can use it on their work PCs.

·	requirements

The good news is that the corpus data can be unpacked and transferred to a
PC's hard disk without too much trouble.  The pre-requirements are:

·	a PC with Windows 95 or better
·	around 6 Giga-bytes (GB) of free disk space
·	Win Zip 32 (a shareware application that is widely available - if you
haven't got it you can download it from http://www.winzip.com/).

The steps involved in creating a copy of the corpus should be easy, but you
might meet a couple of problems because of an error in the creation of the
original CDROM - this may have been fixed in later versions, but users
should be aware of the potential glitch.

·	unpacking

The procedure is as follows:

·	Step 1 - identify the BNC Files
Put Disk 1 of the three disk set in your PC.  Windows Explorer will show
you that this contains the following folders:

A.TGZ 	51,845,713
B.TGZ 	22,991,840
C.TGZ 	65,652,987
D.TGZ    321,820
DOC.TGZ  3,126,547
E.TGZ 	33,186,120
F.TGZ 	46,629,619
G.TGZ 	40,933,113
H.TGZ 	95,877,886
J.TGZ 	27,978,331
K.TGZ 	61,981,017
SARA.TGZ	394,824
SGML.TGZ	125,590

The folders that interest us are A.TGZ through to K.TGZ and DOC.TGZ which
contains the BNC users' guide.  The TGZ extension indicates that the
folders are compressed.  The good news is that WinZip can uncompress these
files and transfer them from the CDROM on to your PC's hard disk.  The bad
news is that folders A, B and C contain compressed folders which have them
selves been incorrectly named, and therefore present a problem for
unpacking.

·	Step 2 - unpack and rename the contents of Folders A,B & C
With installed WinZip on your PC, when double click on Folder A you will
see that it contains a folder called "a".  This should be called "a.tar" -
another file compression format which WinZip can also unpack.  So to make
this folder useable it has to be un-zipped to the hard drive and then
renamed.  Do this in the following way:

-	double click on Folder A
-	in the WinZip window select "a" (this is the only folder)
-	select "Extract" from the WinZip menu
-	choose an appropriate folder on your hard disk drive to which you want to
send the folder (I have a folder called C:\ZIP_TEMP on my PC that I reserve
for this sort of activity).  Extract the folder to this drive.  These are
BIG files ("A" is over 50 MB), so if it takes a few minutes, don't panic!.
-	using Windows Explorer open eg C:\ZIP_TEMP and right click on the file
you have transferred (you will see that it is now much bigger).  Select
"Rename" from the menu and add the extension .TAR to the file name.
-	You will now be able to uncompress this to an appropriate directory - eg
C:\BNC - by (1) double clicking on the folder, (2) chosing "select all"
from the "Actions" menu, and then (3) selecting "Extract" and sending the
files to eg C:\BNC.
-	Repeat these steps for folders B.TGZ and C.TGZ.  This took me some time
to work out, but once you have understood the problem, it's easy to fix.

·	Step 3 - unpacking Folders D - K
-	In Windows Explorer, double click on an appropriate folder on the BNC
CDROM (eg "D")
-	When Winzip asks you "Should WinZip decompress it to a temporary folder
and open it?", select "No".  A second WinZip window will open containing a
single folder .TAR.
-	Double click on this folder and get a list of the folders contained in
the .TAR folder.
-	In WinZip, select all these folders through Actions, Select All.
-	Extract the folders to an appropriate directory (eg C:\BNC)
-	Repeat this process for the remaining folders (ie E, F, G, H, J, K)

You will now have a full version of the text files in BNC on your hard disk
drive.

·	You can use the same procedure to unpack the BNC documentation.  Select
DOC.TGZ and decompress it to an appropriate folder on your PC (eg
C:\BNCDOC).  The information on the BNC documentation is invaluable as it
tells you what is contained in each file of BNC text.

Identifying files which might be useful

Each file in the BNC has what is called "header" information which
specifies exactly what is in the file, where it came from and a whole host
of genre and contextual details.  You can use this to divide the corpus
into subsets.  As an example, I will demonstrate how to separate the 10
million word spoken corpus from the 90 million word written set.  This is a
useful way of making an initial division of the corpus into more useable
lumps, and can be done with the "Find" tool in Windows Explorer.  Once you
have become more confident in breaking the corpus into smaller units, you
will be able to create subsets of the corpus as you require them (eg
fiction, business oriented texts, journalism etc)

·	Step 1 - identify all spoken texts in the BNC
-	Select your BNC folder (eg C:\BNC)
-	Open Windows Explorer, select Tools, Find and then search for all files
containing the header information "<stext".  This will identify all the
files in the BNC which contain spoken data.

·	Step 2 - create a spoken corpus from BNC
-	Now that you know which files contain spoken corpus date, you can use
Windows Explorer to MOVE these files to a new folder called eg C:\BNC_SP.
-	You now have two sets of data to work with - 90 million words of written
text and 10 million words of spoken.

You can use the same procedure for creating smaller domain or text type
specific subcorpora (eg corpus files containing the string "wridom6"  are
classed as Written: Domain: Informative: Commerce and Finance).  These have
many practical applications for language teaching and learning, and are
easier and quicker to handle than the full corpus (though they can be
combined in future searches if you want to draw on the full corpus).

[For a (much better) alternative, contact Dave Lee
(david_lee00 at hotmail.com) who has put together a really neat Excel
spreadsheet which tells you which files contain what categories of text --
and also make a better fist of genre than the original BNC categories.]

Working with the corpus

I am not going to expand on this here.  I would refer readers to my own
book (Tribble C & G Jones (1997)   Concordances in the Classroom: a
resource book for teachers   Athelstan  Houston TX) for an overview of
approaches to the use of corpus data in language education.  As far as
software tools are concerned, apart from WinZip, I would recommend you get
hold of either WordSmith Tools (Scott M 1996  WordSmith Tools   Oxford
University Press  Oxford) or Monoconc Pro (Barlow M 1998 Monconc Pro
Athelstan  Houston TX).

Best

Chris Tribble
--
		Dr Christopher Tribble
Sri Lanka	21 Wijerama Mawatha, Colombo 7
		TEL  +94 75 332 309
UK   		122, Queen Alexandra Mansions, Judd Street
		London WC1 H 9DQ
		TEL +44 171 833 4271
UK Mailing	c/o FCO (Colombo)
		The British Council (Sri Lanka)
		King Charles Street, London SW1A 2AH
E-mail		ctribble at sri.lanka.net
Home Page	http://ourworld.compuserve.com/homepages/Christopher_Tribble