[Corpora-List] size of reference corpus
Peter Lam
peterlam at onebb.net
Sat Jun 14 06:51:29 UTC 2003
Thanks very much, Mike.
Your advice do help. I'll go ahead as you advised, i.e., maintain the size
of my specialised corpus, which is double that of the Brown. The 2 resulted
key word lists look almost the same whether I use the Brown corpus or mine
as the reference corpus: the only difference is that the order of the key
words is the reverse of the other, apart from some possible statistical
implications unknown to me.
Thanks for directing me to the locations where I can download the word-lists
of BNC and the Guardian. But as I'll do also collocation and colligation
analyses on the key words, I can only use these word-lists as a reference,
particularly when I need to compare the word-list of my corpus with those
derived from corpora of contemporary English.
In fact, I'll compare "my mouse-mat against all the objects in your room"
and probably also "against a whole lot of other mouse-mats."
By the way, could you direct me to any websites where free lemmatising
and/or tagging software is available.
Best wishes,
Peter
----- Original Message -----
From: "Mike Scott" <mike at lexically.net>
To: "Lam Yuen Wing, Peter" <ywlam at kcrc.com>; <corpora at hd.uib.no>
Cc: <peterlam at onebb.net>
Sent: Friday, June 13, 2003 10:40 PM
Subject: Re: [Corpora-List] Re: size of reference corpus
> Lam Yuen Wing, Peter wrote:
>
> >I'm working on my MA degree project using WordSmith Tools to analyse a
> >specialised corpus. Because of the limited availability of a reference
> >corpus, I can only use the Brown corpus as my reference corpus, which is
> >just half the size of my specialised corpus.
> >
> >Could anyone advise me the implications of having a reference corpus half
> >the size of the specialised in a corpus analysis using WordSmith Tools.
> If the reason why you need the reference corpus is for processing
key-words
> in WordSmith, you will usually need one that's *bigger* than your
> specialised corpus. WS checks to see which is bigger before doing the
> key-words procedure. But it's a bit more complicated than that...
>
> 1. the KeyWords procedure was originally designed to study texts, not
> genres or languages. All the same, it can be used for collections of texts
> and still try to locate lexical items whose frequency is unusual. But
there
> will certainly be statistical implications of a non-straightforward nature
> (OK, there are in almost anything but especially with such odd items to
> study as words, which do not distribute themselves at all "normally"). So
> my advice is -- go ahead, but think of it as a method of finding out which
> words may well repay further investigation.
>
> 2. You mightn't need the actual text for your reference corpus but only a
> word-list based on that corpus. You can download a full word-list of BNC
> written (based on 90 million words) and BNC spoken (10 million) from my
> website. You can also download a wordlist based on nearly 100 million
words
> of the UK newspaper The Guardian, 1990-94 as I recall, which contains all
> items occurring at least twice. All these are word-lists in WordSmith 3
> format. (I plan to do these again in WS4 format but in any case WS4 comes
> with a conversion tool.)
>
> 3. There are lots of modes of comparison. You can of course study
> individual texts or smaller sets of texts and compare them with Brown one
> by one. You can compare individual texts with the set of all texts in your
> specialised corpus. I think this depends on what your research questions
> are (what you want to find out about the specialised corpus).
>
> 4. The easiest way of thinking about this (to me anyway) is by analogy. If
> you want to find out the characteristics of the mouse-mat in front of you,
> you might compare it against a whole lot of other mouse-mats (discovering
> it's much brighter, say) or against a whole lot of computer stuff in front
> of you (it's not beige), or against all the objects in your room (it's
much
> flatter than most).
>
> Hope this helps.
>
> Best wishes -- Mike
>
>
> Mike Scott
>
> Applied English Language Studies Unit
> University of Liverpool
> Liverpool L69 3BX, UK.
>
> Mike.Scott at liv.ac.uk
> http://www.lexically.net
> http://www.liv.ac.uk/~ms2928
>
>
>
More information about the Corpora
mailing list