[Corpora-List] Keyness across Texts
Ute Römer
ute.roemer at engsem.uni-hannover.de
Mon Jul 9 15:46:26 UTC 2007
Thanks for this, Przemek! I was getting ready to apologise to Duncan for
including the ebrary link in my posting and for all the off-thread messages
it triggered...
Best... Ute
> -----Original Message-----
> From: owner-corpora at lists.uib.no
> [mailto:owner-corpora at lists.uib.no] On Behalf Of Przemek Kaszubski
> Sent: Monday, July 09, 2007 5:28 PM
> To: Hunter, Duncan
> Subject: Re: [Corpora-List] Keyness across Texts
>
> Hello,
>
> Mike Scott's perspective of keywords is mostly textual, as he
> explains, rather than corpus-oriented. The same goes for
> 'aboutness', as I understand it. I think determining
> 'aboutness' in a corpus will largely depend on how many texts
> you have in your corpus, and how diversified you reckon them
> to be. Based on some empirical fidgeting you may first need
> to figure out the optimum frequency and text range thresholds
> for your keywords. The former is easy to get - WordSmith
> Keywords will show you the corpus frequency for each key
> word; the latter - less so: you might need to generate
> "consistency counts" for all the words in the main (and
> perhaps also reference) corpora and then use a database
> program to match the keyword and consistency lists into a
> single spreadsheet/matrix. You can then filter a list of
> keywords matching your threshold criteria and then,
> empirically again, work out a custom formula for "enhanced"
> keyness, e.g. by combining keyness values with text range by
> dividing one by the other, etc.
>
> Text range alone may be an insufficient adjustment, though.
> Words can be spread over a decent % od texts and still show
> extreme burstiness in a minority of texts - you will probably
> want to know this. Therefore, it makes sense to work keyness,
> range *and* a measure of centrality like standard deviation
> (or variance) of per-text frequency into your formula. Adam
> Kilgarriff once used variance/mean for ranking some of the
> BNC wordlists, which I often resort to as a useful measure of
> a word's 'coreness' in a corpus.
>
> Amateurish as they are, I find that some of my keyness
> formulae working nicely -- separating words with a high
> comparative frequency AND stable recurrence in many files
> from those which are also key but behave more erratically. I
> tend to refer to the former as 'technical' as opposed to
> 'topical' words, since I work with pretty homogenous
> collections. As I said, balancing the keyness index against
> range/variance may be
> corpus-specific: sometimes you will want to tweak your
> formula to favour keyness (frequency), sometimes your
> slightly higher priority will be range/consistency.
>
> Just like you, I will be happy to read more about (relatively
> uncomplicated) statistical solutions that may have been in use.
>
> Regards,
>
> Przemek
>
>
> Hunter, Duncan wrote (2007-07-09 14:29):
> >
> > Hello Colleagues!
> >
> > A question about key-ness, and key words, in a group of texts
> >
> > Ive been mulling over some key-ness statistics for a
> selection of
> > texts Ive been studying and a rather odd question has
> occurred to me
.
> >
> > Ive been attempting to discover something of the thematic
> content or
> > about-ness of a group of texts by using a keywords analysis,
> > comparing the word frequency profile of the selection of
> texts with a
> > comparative group to derive key-ness (via log-likelihood)
> stats for
> > each word.
> >
> > The key-ness value returned by such a procedure can be misleading
> > because of the problem of dispersal; is the word key because it
> > occurs in a lot of text samples in the corpus or because of a very
> > high usage in only a single text or small group of texts?
> >
> > It occurs to me; would it be possible to formulate some kind of
> > measure of a words overall key-ness in the set of texts we are
> > studying? By multiplying together the words key score by
> the number
> > of texts in which it is key, for example. Of course the resulting
> > figure in this case would be totally arbitrary in a
> sense-even in the
> > non-parametric realm of corpus comparison measurement it would not
> > really mean anything beyond its own description...
> >
> > However it seems to me useful to have some kind of
> quantitative means
> > of describing a words significance across a range of texts in some
> > way
Any ideas? I am a relative 'newbie' in this field, surely this
> > issue has been tackled by somebody else somewhere? !
> >
> > All the best,
> >
> > Duncan Hunter
> >
>
> --
> Dr Przemyslaw Kaszubski
> +48 61 8293515
>
> PICLE EAP LEARNER CORPUS ONLINE:
> http://www.staff.amu.edu.pl/~przemka/picle.html
>
> CORPUS LINGUISTICS BIBLIOGRAPHY:
> http://www.staff.amu.edu.pl/~przemka
>
> MY CORPUS LINGUISTICS SEMINARS:
> http://www.staff.amu.edu.pl/~przemka/seminars.htm
>
> EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
> http://www.staff.amu.edu.pl/~przemka/IFA_writing
>
> =======================================
> School of English (IFA)
> Adam Mickiewicz University
> http://ifa.amu.edu.pl
> =======================================
>
>
>
More information about the Corpora
mailing list