[Corpora-List] corpus syntax (and how we can use it to code meaning)

Yorick Wilks Yorick at dcs.shef.ac.uk
Tue Sep 18 11:05:28 UTC 2007


Rob
[One feels nervous about continuing this now--but you are raising new  
and important issues]

I think you may misunderstand what machine learning can do---though  
of course it all depends on what you mean by learning/generalizing  
from the
"same context" Modesty normally forbids me citing, say, http:// 
citeseer.ist.psu.edu/stevenson01interaction.html, where Stevenson and  
I combined learners
for word-sense disambiguation over quite a large corpus (The Red  
Badge of Courage), and one way of interpreting what the learner was  
doing (and it is something some would find distasteful) is that it  
was learning for-each-sense-of each-content -word what were the  
contexts and criteria that would disambiguate it-----there are other  
bits of contemporary
and later work that could also be described that way (and this was  
not at all simple unsupervised learning, either).
Best
Yorick

PS On the on-going meta-issues, I fear the last paragraph of Eric  
Atwell's message is very insightful as to what is really going on  
here, under cloaks of
private fights", "abstract discussions" "separate lists" etc.:

"But my impression is that most Corpus Linguists are not really that
interested in unsupervised Machine Learning, i.e. letting the computer
work out the grammar/semantics "from scratch"; they prefer to examine  
and
analyse the corpus data "by hand" to select examples to back up their
own theories..."

I have a hunch most Corpus Linguists are not interested much in  
computation in general, except as a secretarial/editing/retrieval  
tool, but they have to pay lip service to it.
Paradoxically, I think, it is CL/NLP researchers who actually "trust  
the text", in they are experimenters who, by definition, dont know  
what the results of computation/
experiment will be. Many Corpus Linguists, I suspect, and there are  
honourable exceptions, know exactly where they are going and are as  
dependent on intuition and judgement as much
as Chomskyans, who they still affect to criticize, and for reasons  
not all together clear to me.  I have an on-going struggle with a  
distinguished lexicographer friend and colleague, who uses  
sophisticated KWIC indices to display contexts of a word, which he  
then classifies by intuition. Suggestions as to how this last stage  
could be automated, and I have made many over the years, are never  
well received and I have stopped.





On 18 Sep 2007, at 11:31, Rob Freeman wrote:

> On 9/18/07, Eric Atwell <eric at comp.leeds.ac.uk> wrote:
> On Tue, 18 Sep 2007, Rob Freeman wrote:
>
> > ..., we might use the context about a word or phrase to
> > select, ad-hoc, a class of words or phrases with are similar to  
> that word or
> > phrase (in that context.) ...  we can use these true/not
> > true distinctions to select both syntax, and meaning, specific to  
> context,
> > in ways we have not been able up to now.
>
> This suggests that corpus linguists should be interested in clustering
> or unsupervised machine learning of words into classes according to
> shared contexts; but they have been investigating this for some time,
> see e.g. papers in Proceedings of ICAME'86, EACL'87.
> The main difference between then and now is compute power: we can now
> use more sophisticated clustering algorithms, and cluster according to
> more complex context patterns, e.g. Roberts et al in Corpora, vol. 1,
> pp. 39-57. 2006.
>
> Yes, people have been clustering words into classes according to  
> shared contexts for some time.
>
> The point here is the idea that they need to cluster them into a  
> different class for each context in which they occur.
>
> It is the goals of machine learning which I am suggesting need to  
> change (viz. a complete grammar), not the methods.
>
> I think computational linguistics will get good results as soon as  
> it stops looking for global generalizations and clusters ad-hoc,  
> according to context.
>
> But my impression is that most Corpus Linguists are not really that
> interested in unsupervised Machine Learning, i.e. letting the computer
> work out the grammar/semantics "from scratch"; they prefer to  
> examine and
> analyse the corpus data "by hand" to select examples to back up their
> own theories...
>
> Whether they are working "by hand" or not, people are not used to  
> thinking of syntax as ad-hoc generalization according to shared  
> contexts. I'm suggesting this idea needs to be taken out of machine  
> learning (where it has only been seen as a means to find "grammar"  
> anyway, and not a principle of syntax in its own right) and given a  
> broader airing as a principle of syntax on it own merits.
>
> It might explain why MWE's tend to have the same "slot fillers" for  
> instance. Detailed analyses of what slot fillers can occur in a  
> given MWE could be done on the basis of what other contexts two  
> words share and do not share.
>
> Corpus analysis currently tends to be done in terms of lexicon,  
> what units are repeated, how often. Corpus style syntactic analyses  
> could be done on the basis of what words share what contexts, and  
> how well this predicts the range of combinations they participate  
> in, how MWE's change over time etc.
>
> -Rob
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070918/f00c4498/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list