[Corpora-List] Keyness across Texts (fwd)

Sat Aug 4 18:04:44 UTC 2007

---------- Forwarded message ----------
Date: Thu, 12 Jul 2007 20:13:32 +0200
From: Przemek Kaszubski <przemka at amu.edu.pl>
To: John F. Sowa <sowa at bestweb.net>
Cc: corpora at uib.no
Subject: Re: [Corpora-List] Keyness across Texts

Hello,

John F. Sowa wrote (2007-07-10 19:03):
>
> The fundamental issue about any version of "keyness" is the
> definition and the algorithms that implement the definition.
Agreed, of course. I also think the definition selected will (should?)
reflect the linguistic purpose of one's investigation and the
homo-/hetero-geneity of the texts/corpora used. My own interest is not
so much in the key words, or key clusters, alone as in using their
evidence as a window onto a more focused (lexical, semantic, textual,
functional etc.) characterisation of a specialised text collection /
corpus. All this in order to optimise a user's concordancing effort when
learning a specialist text-type/genre.

> The simple definition in terms of frequency counts is the most
> widely used because it can be implemented in simple algorithms.
> But even then, questions arise about lemmata:  are the algorithms
> counting words or lemmata?  And how do the algorithms deal with
> lemmata that are lexicalized in different parts of speech?
Currently I am trying to follow the radically lexicalist assumption
(following Sinclair, Hoey, and other followers of the British
'corpus-driven' school) of withholding lemmatisation until having
inspected individual wordform behaviour. I find the textual/corpus
evidence I have seen and read compelling enough, despite some persistent
psycholinguistic claims that preserve lemmas\lexemes as chief
organisational units of the (mental) lexicon. My reason for
(temporarily?) favouring relatively shallow and simple algorithms is
that I want learners/students to do some of the essential noticing,
pattern discernment etc. - mine is thus mostly a language
acquisition-driven motivation.

>
> Clusters can provide a more precise way of defining keyness,
> but the number of variations of clustering algorithms is
> immense, and each one defines a different version of keyness.
This is a most interesting area, indeed. I have not tried this, nor do I
have expertise for approaching this statistically or formally now. What
I have been doing is adding *key* cluster analyses to the regular
single-word key word analyses. Interestingly, there does not appear to
be much work done in this area, contrary to the popularity of 'simple'
high-frequency cluster/bundle studies. I have noticed that key clusters
that come top of the keyness list (with LL as measure) are different
from those which are simply the most frequent ones. I am aware that such
profiles will obviously depend on the relations between the 'main',
experimental corpus and the reference, 'control' corpus. But that is not
a problem for me at this point.

> The next step is to apply syntactic and/or semantic techniques
> to determine how the words/lemmata are related.  Then the
> syntactic and/or semantic structures could be used as input
> to the counting and/or clustering methods.
This is something I would not like to do at this stage, certainly not
automatically, for the same applied reasons plus, alas again, due to my
ignorance. As demonstrated by Hoey in his lexical priming theory,
semantic association can be a powerful generator of syntagmatic strings,
but only some of them will be natural choices in a given genre/domain,
which is my chief scope of description. The same, of course, goes for
syntactic structures, some of which are semantically-motivated, some
others are not. Pattern grammar might be helpful, but I have not had the
chance to match it against my key-cluster data. A set of 'local' pattern
grammars would be closer to home, and what I can do at this point is to
try to infer them from the literature, so I can then use this as
reference for discussing the keyness data I get.

>
> And of course, you could also apply an ontology to relate the
> words and/or lemmata.  In fact, you might even use the technology
> to extract a document-specific ontology from the texts. And then
> one could use that ontology for analyzing other documents.
Yes, on condition that my goal is merely 'aboutness', as far I can
understand your point. Aboutness is not my only, nor perhaps my chief,
goal, though.

>
> In short, there is no clear distinction between keyness and
> any other issues of semantics.  There is nothing wrong with
> using simple, special-purpose techniques for addressing a
> particular problem, but it is important to recognize their
> limitations and their relationships to broader semantic issues.
Very much so. I am aware of the limitations of simple computations. On
the other hand, I can't help being slightly mistrustful of more
advanced, 'black-box' methods which produce results with a good face
validity but unknown (to me) precision/recall. Plus there is the applied
issue I have mentioned. Also, precisely because my orientation is
applied, I must cast my net wider. I am pursuing not only semantic
relations and sets, but also patterns affecting 'grammatical', bleached,
delexicalised etc. words which may have become so in the specific
context/domain/genre. And I do not hope to be able to pursue all of
these patterns, just the most characteristic (key?) cases and groups,
from which other users can learn and continue with their own investigations.

Thank you very much for your insightful comments. Always time to broaden
horizons.

Przemek

--
Dr Przemyslaw Kaszubski
+48 61 8293515

PICLE EAP LEARNER CORPUS ONLINE:
http://www.staff.amu.edu.pl/~przemka/picle.html

CORPUS LINGUISTICS BIBLIOGRAPHY:
http://www.staff.amu.edu.pl/~przemka

MY CORPUS LINGUISTICS SEMINARS:
http://www.staff.amu.edu.pl/~przemka/seminars.htm

EAP WRITING PAGE (IFA FULL-TIME PROGRAMME):
http://www.staff.amu.edu.pl/~przemka/IFA_writing

=======================================
School of English (IFA)
Adam Mickiewicz University
http://ifa.amu.edu.pl
=======================================

_______________________________________________
corpora mailing list
corpora at uib.no
http://mailman.uib.no/listinfo/corpora