[Corpora-List] Re: size of reference corpus

Mike Scott mike at lexically.net
Fri Jun 13 14:40:01 UTC 2003


Lam Yuen Wing, Peter wrote:

>I'm working on my MA degree project using WordSmith Tools to analyse a
>specialised corpus.  Because of the limited availability of a reference
>corpus, I can only use the Brown corpus as my reference corpus, which is
>just half the size of my specialised corpus.
>
>Could anyone advise me the implications of having a reference corpus half
>the size of the specialised in a corpus analysis using WordSmith Tools.
If the reason why you need the reference corpus is for processing key-words
in WordSmith, you will usually need one that's *bigger* than your
specialised corpus. WS checks to see which is bigger before doing the
key-words procedure. But it's a bit more complicated than that...

1. the KeyWords procedure was originally designed to study texts, not
genres or languages. All the same, it can be used for collections of texts
and still try to locate lexical items whose frequency is unusual. But there
will certainly be statistical implications of a non-straightforward nature
(OK, there are in almost anything but especially with such odd items to
study as words, which do not distribute themselves at all "normally"). So
my advice is -- go ahead, but think of it as a method of finding out which
words may well repay further investigation.

2. You mightn't need the actual text for your reference corpus but only a
word-list based on that corpus. You can download a full word-list of BNC
written (based on 90 million words) and BNC spoken (10 million) from my
website. You can also download a wordlist based on nearly 100 million words
of the UK newspaper The Guardian, 1990-94 as I recall, which contains all
items occurring at least twice. All these are word-lists in WordSmith 3
format. (I plan to do these again in WS4 format but in any case WS4 comes
with a conversion tool.)

3. There are lots of modes of comparison. You can of course study
individual texts or smaller sets of texts and compare them with Brown one
by one. You can compare individual texts with the set of all texts in your
specialised corpus. I think this depends on what your research questions
are (what you want to find out about the specialised corpus).

4. The easiest way of thinking about this (to me anyway) is by analogy. If
you want to find out the characteristics of the mouse-mat in front of you,
you might compare it against a whole lot of other mouse-mats (discovering
it's much brighter, say) or against a whole lot of computer stuff in front
of you (it's not beige), or against all the objects in your room (it's much
flatter than most).

Hope this helps.

Best wishes -- Mike


Mike Scott

Applied English Language Studies Unit
University of Liverpool
Liverpool L69 3BX, UK.

Mike.Scott at liv.ac.uk
http://www.lexically.net
http://www.liv.ac.uk/~ms2928



More information about the Corpora mailing list