[Corpora-List] corpus syntax (and how we can use it to code meaning)

Khurshid Ahmad kahmad at cs.tcd.ie
Tue Sep 18 11:58:01 UTC 2007


Yorick
Modesty and nervousness initially stopped me replying to your thoughts on
'machine learning', and unsupervised learning, from corpora.  But these
are risk averse times.

We had carried out a number of experiments on automatic categorisation of
texts, in a corpus, using Kohonen Feature Maps or Self Organising Maps. 
The results were encouraging in that our systems learnt to classify news
paper stories (Reuters) and articles in specialist journals.  The point of
introversion/intuition was when we had to choose a 'feature vector' to
train our large SOM.

Pensiri Manomaisupat, Bogdan Vrusias & Khurshid Ahmad, Categorization of
Large Text Collections: Feature selection for unsupervised and supervised
neural networks, 7th Int. Data Engineering and Automated Learning Conf.
(Lecture Notes on Computer Science - LNCS 4224), Burgos, Spain, 20th-23rd
September , edited by E. Corchado, H. Yin, V. Botti & C. Fyfe ,
Springer-Verlag, 2006, pp1003 - 1013.

> Rob
> [One feels nervous about continuing this now--but you are raising new
> and important issues]
>
> I think you may misunderstand what machine learning can do---though
> of course it all depends on what you mean by learning/generalizing
> from the
> "same context" Modesty normally forbids me citing, say, http://
> citeseer.ist.psu.edu/stevenson01interaction.html, where Stevenson and
> I combined learners
> for word-sense disambiguation over quite a large corpus (The Red
> Badge of Courage), and one way of interpreting what the learner was
> doing (and it is something some would find distasteful) is that it
> was learning for-each-sense-of each-content -word what were the
> contexts and criteria that would disambiguate it-----there are other
> bits of contemporary
> and later work that could also be described that way (and this was
> not at all simple unsupervised learning, either).
> Best
> Yorick
>
> PS On the on-going meta-issues, I fear the last paragraph of Eric
> Atwell's message is very insightful as to what is really going on
> here, under cloaks of
> private fights", "abstract discussions" "separate lists" etc.:
>
> "But my impression is that most Corpus Linguists are not really that
> interested in unsupervised Machine Learning, i.e. letting the computer
> work out the grammar/semantics "from scratch"; they prefer to examine
> and
> analyse the corpus data "by hand" to select examples to back up their
> own theories..."
>
> I have a hunch most Corpus Linguists are not interested much in
> computation in general, except as a secretarial/editing/retrieval
> tool, but they have to pay lip service to it.
> Paradoxically, I think, it is CL/NLP researchers who actually "trust
> the text", in they are experimenters who, by definition, dont know
> what the results of computation/
> experiment will be. Many Corpus Linguists, I suspect, and there are
> honourable exceptions, know exactly where they are going and are as
> dependent on intuition and judgement as much
> as Chomskyans, who they still affect to criticize, and for reasons
> not all together clear to me.  I have an on-going struggle with a
> distinguished lexicographer friend and colleague, who uses
> sophisticated KWIC indices to display contexts of a word, which he
> then classifies by intuition. Suggestions as to how this last stage
> could be automated, and I have made many over the years, are never
> well received and I have stopped.
>
>
>
>
>
> On 18 Sep 2007, at 11:31, Rob Freeman wrote:
>
>> On 9/18/07, Eric Atwell <eric at comp.leeds.ac.uk> wrote:
>> On Tue, 18 Sep 2007, Rob Freeman wrote:
>>
>> > ..., we might use the context about a word or phrase to
>> > select, ad-hoc, a class of words or phrases with are similar to
>> that word or
>> > phrase (in that context.) ...  we can use these true/not
>> > true distinctions to select both syntax, and meaning, specific to
>> context,
>> > in ways we have not been able up to now.
>>
>> This suggests that corpus linguists should be interested in clustering
>> or unsupervised machine learning of words into classes according to
>> shared contexts; but they have been investigating this for some time,
>> see e.g. papers in Proceedings of ICAME'86, EACL'87.
>> The main difference between then and now is compute power: we can now
>> use more sophisticated clustering algorithms, and cluster according to
>> more complex context patterns, e.g. Roberts et al in Corpora, vol. 1,
>> pp. 39-57. 2006.
>>
>> Yes, people have been clustering words into classes according to
>> shared contexts for some time.
>>
>> The point here is the idea that they need to cluster them into a
>> different class for each context in which they occur.
>>
>> It is the goals of machine learning which I am suggesting need to
>> change (viz. a complete grammar), not the methods.
>>
>> I think computational linguistics will get good results as soon as
>> it stops looking for global generalizations and clusters ad-hoc,
>> according to context.
>>
>> But my impression is that most Corpus Linguists are not really that
>> interested in unsupervised Machine Learning, i.e. letting the computer
>> work out the grammar/semantics "from scratch"; they prefer to
>> examine and
>> analyse the corpus data "by hand" to select examples to back up their
>> own theories...
>>
>> Whether they are working "by hand" or not, people are not used to
>> thinking of syntax as ad-hoc generalization according to shared
>> contexts. I'm suggesting this idea needs to be taken out of machine
>> learning (where it has only been seen as a means to find "grammar"
>> anyway, and not a principle of syntax in its own right) and given a
>> broader airing as a principle of syntax on it own merits.
>>
>> It might explain why MWE's tend to have the same "slot fillers" for
>> instance. Detailed analyses of what slot fillers can occur in a
>> given MWE could be done on the basis of what other contexts two
>> words share and do not share.
>>
>> Corpus analysis currently tends to be done in terms of lexicon,
>> what units are repeated, how often. Corpus style syntactic analyses
>> could be done on the basis of what words share what contexts, and
>> how well this predicts the range of combinations they participate
>> in, how MWE's change over time etc.
>>
>> -Rob
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


Khurshid Ahmad

Professor of Computer Science
Department of Computer Science
Trinity College,
DUBLIN-2
IRELAND
Phone 00 353 1 896 8429

Web Page: http://people.tcd.ie/kahmad


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list