7.1661, Sum: Corpus design

The Linguist List linguist at unix.tamu.edu
Sun Nov 24 05:39:16 UTC 1996


---------------------------------------------------------------------------
LINGUIST List:  Vol-7-1661. Sat Nov 23 1996. ISSN: 1068-4875. Lines:  115
 
Subject: 7.1661, Sum: Corpus design
 
Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at unix.tamu.edu>
            Helen Dry: Eastern Michigan U. <hdry at emunix.emich.edu> (On Leave)
            T. Daniel Seely: Eastern Michigan U. <dseely at emunix.emich.edu>
 
Associate Editors: Ljuba Veselinova <lveselin at emunix.emich.edu>
                   Ann Dizdar <dizdar at unix.tamu.edu>
Assistant Editor:  Sue Robinson <robinson at emunix.emich.edu>
Technical Editor:  Ron Reck <rreck at emunix.emich.edu>
 
Software development: John H. Remmers <remmers at emunix.emich.edu>
 
Editor for this issue: robinson at emunix.emich.edu (Susan Robinson)
 
---------------------------------Directory-----------------------------------
1)
Date:  Thu, 21 Nov 1996 08:29:38 +0800
From:  aclynes at ubd.edu.bn (Adrian Clynes)
Subject:  Summary, corpus query
 
---------------------------------Messages------------------------------------
1)
Date:  Thu, 21 Nov 1996 08:29:38 +0800
From:  aclynes at ubd.edu.bn (Adrian Clynes)
Subject:  Summary, corpus query
 
 
Here are edited responses to a query about
corpus-design-for-beginners, posted to the List on 2 November.  Many
thanks to the following for their time and suggestions: Imran Ho
imran.ho at stonebow.otago.ac.nz, Ellen Gurman Bard, ellen at ling.ed.ac.uk,
Claire Warwick <claire.warwick at computing-services.oxford.ac.uk>, and
Michael Barlow <barlow at ruf.rice.edu>:
 
1)  From Imran Ho <imran.ho at stonebow.otago.ac.nz>
 
1.i am sure you must be aware of the corpora list (ICAME) which
contains specific discussion of corpus linguistics. They also have a
list of software which might be of interest to your colleagues at
UBD. Altenberg's bibliography is a good place for references and is
available from the same site.
2. i am currently compiling a corpus of written Malaysian English -
following the organisation of the LOB/Brown and Wellington corpus of
NZ English. I use the Oxford Concordance Programme for extracting the
info i need from the ME corpus.  I have also tried MonoConc (available
on both Mac and PC) [AC: see Michael Barlow's response below ]and the
programme seems to be a very user friendly programme.
3. For tagging ... try the Birmingham Tagger (via e-mail), however,
with a learners' corpus beware... the tagger has an accuracy of (i
would guess based on the tagging I have done -- around 80%)..so alot
of editing is needed.
4. Hardware is not really a problem...For my corpus of newspaper text,
most of the texts are already in electonic form and only needed to be
downloaded. The rest of the texts are scanned using Calera
WordScan. The storage space for 44 texts of 2,000 wds is around
612k. So if you have 500 texts you might need 6Mb of disc space. I
store my documents in text format (ascii).
5. There is a particular stage of corpus development which needs
careful thought at some stage ... ie. the reference and mark up for
the texts.  Imran
 
 
2) From: Ellen Gurman Bard (ellen at ling.ed.ac.uk)
For some examples of design and collection techniques for spoken
corpora, you might want to have a look at:
 
Anderson, A.  H., Bader, M., Bard, E.  G., Boyle, E., Doherty, G.,
Garrod, S., Isard, S., Kowtko, J., McAllister, J.  M., Miller, J.,
Sotillo, C., Thompson, H., Weinert, R.  (1991). The HCRC Map Task Corpus.
LANGUAGE AND SPEECH, 34(4), 351-66.
 
Bard, E.  G., Sotillo, C.  F., Anderson, A.  H., and Taylor, M.  M.  (in
press).  The DCIEM Map Task Corpus: Spontaneous Dialogue under Sleep
Deprivation and Drug Treatment.  SPEECH COMMUNICATION.
 
or (1996) PROCEEDINGS OF INTERNATIONAL CONFERENCE ON SPEECH
AND LANGUAGE PROCESSING
 
 
3) From: Claire Warwick <claire.warwick at computing-services.oxford.ac.uk>
 
You may like to look at the web page for the British
National Corpus, at http://info.ox.ac.uk/bnc. It
should provide you with some of the
information that you need.
 
4) From: Michael Barlow <barlow at ruf.rice.edu>
 
You might look at my corpus linguistics page:
http://www.ruf.rice.edu/~barlow/corpus.html
 
I have developed a couple of concordance programs. MonoConc for
Windows is a commercial program published by Athelstan (my
company).
You can download a demo from http://www.nol.net/athel.html.
A Hypertalk-based Mac concordancer (MonoConc) can be downloaded from:
http://www.ruf.rice.edu/~barlow/mono.html.
 
 
Again, my thanks to those who responded for their time and suggestions.
 
 
 
Adrian Clynes
aclynes at ubd.edu.bn
Dept of English & Applied Linguistics			
Universiti Brunei Darussalam, Brunei					
	
						
------------------------------------------------------------------------
LINGUIST List: Vol-7-1661.



More information about the LINGUIST mailing list