<HTML><BODY style="word-wrap: break-word; -khtml-nbsp-mode: space; -khtml-line-break: after-white-space; ">Rob<DIV>[One feels nervous about continuing this now--but you are raising new and important issues]</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>I think you may misunderstand what machine learning can do---though of course it all depends on what you mean by learning/generalizing from the </DIV><DIV>"same context" Modesty normally forbids me citing, say, <A href="http://citeseer.ist.psu.edu/stevenson01interaction.html">http://citeseer.ist.psu.edu/stevenson01interaction.html</A>, where Stevenson and I combined learners</DIV><DIV>for word-sense disambiguation over quite a large corpus (The Red Badge of Courage), and one way of interpreting what the learner was doing (and it is something some would find distasteful) is that it was learning for-each-sense-of each-content -word what were the contexts and criteria that would disambiguate it-----there are other bits of contemporary</DIV><DIV>and later work that could also be described that way (and this was not at all simple unsupervised learning, either).</DIV><DIV>Best</DIV><DIV>Yorick</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV>PS On the on-going meta-issues, I fear the last paragraph of Eric Atwell's message is very insightful as to what is really going on here, under cloaks of </DIV><DIV>private fights", "abstract discussions" "separate lists" etc.:</DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">"But my impression is that most Corpus Linguists are not really that</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">interested in unsupervised Machine Learning, i.e. letting the computer</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">work out the grammar/semantics "from scratch"; they prefer to examine and</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">analyse the corpus data "by hand" to select examples to back up their</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">own theories..."</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">I have a hunch most Corpus Linguists are not interested much in computation in general, except as a secretarial/editing/retrieval tool, but they have to pay lip service to it.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Paradoxically, I think, it is CL/NLP researchers who actually "trust the text", in they are experimenters who, by definition, dont know what the results of computation/</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">experiment will be. Many Corpus Linguists, I suspect, and there are honourable exceptions, know exactly where they are going and are as dependent on intuition and judgement as much</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">as Chomskyans, who they still affect to criticize, and for reasons not all together clear to me. I have an on-going struggle with a distinguished lexicographer friend and colleague, who uses sophisticated KWIC indices to display contexts of a word, which he then classifies by intuition. Suggestions as to how this last stage could be automated, and I have made many over the years, are never well received and I have stopped.</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR class="khtml-block-placeholder"></DIV><DIV><BR><DIV><DIV>On 18 Sep 2007, at 11:31, Rob Freeman wrote:</DIV><BR class="Apple-interchange-newline"><BLOCKQUOTE type="cite">On 9/18/07, <B class="gmail_sendername">Eric Atwell</B> <<A href="mailto:eric@comp.leeds.ac.uk">eric@comp.leeds.ac.uk</A>> wrote:<DIV><SPAN class="gmail_quote"></SPAN><BLOCKQUOTE class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> On Tue, 18 Sep 2007, Rob Freeman wrote:<BR><BR>> ..., we might use the context about a word or phrase to<BR>> select, ad-hoc, a class of words or phrases with are similar to that word or<BR>> phrase (in that context.) ... we can use these true/not <BR>> true distinctions to select both syntax, and meaning, specific to context,<BR>> in ways we have not been able up to now.<BR><BR>This suggests that corpus linguists should be interested in clustering<BR>or unsupervised machine learning of words into classes according to <BR>shared contexts; but they have been investigating this for some time,<BR>see e.g. papers in Proceedings of ICAME'86, EACL'87.<BR>The main difference between then and now is compute power: we can now<BR>use more sophisticated clustering algorithms, and cluster according to <BR>more complex context patterns, e.g. Roberts et al in Corpora, vol. 1,<BR>pp. 39-57. 2006.</BLOCKQUOTE><DIV><BR>Yes, people have been clustering words into classes according to shared contexts for some time.<BR><BR>The point here is the idea that they need to cluster them into a different class for each context in which they occur. <BR><BR>It is the goals of machine learning which I am suggesting need to change (viz. a complete grammar), not the methods.<BR><BR>I think computational linguistics will get good results as soon as it stops looking for global generalizations and clusters ad-hoc, according to context. <BR></DIV><BR><BLOCKQUOTE class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">But my impression is that most Corpus Linguists are not really that<BR>interested in unsupervised Machine Learning, i.e. letting the computer<BR>work out the grammar/semantics "from scratch"; they prefer to examine and<BR>analyse the corpus data "by hand" to select examples to back up their<BR>own theories...</BLOCKQUOTE> <DIV><BR>Whether they are working "by hand" or not, people are not used to thinking of syntax as ad-hoc generalization according to shared contexts. I'm suggesting this idea needs to be taken out of machine learning (where it has only been seen as a means to find "grammar" anyway, and not a principle of syntax in its own right) and given a broader airing as a principle of syntax on it own merits. <BR><BR>It might explain why MWE's tend to have the same "slot fillers" for instance. Detailed analyses of what slot fillers can occur in a given MWE could be done on the basis of what other contexts two words share and do not share. <BR><BR>Corpus analysis currently tends to be done in terms of lexicon, what units are repeated, how often. Corpus style syntactic analyses could be done on the basis of what words share what contexts, and how well this predicts the range of combinations they participate in, how MWE's change over time etc. <BR><BR>-Rob<BR></DIV></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">_______________________________________________</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; ">Corpora mailing list</DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><A href="mailto:Corpora@uib.no">Corpora@uib.no</A></DIV><DIV style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><A href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</A></DIV> </BLOCKQUOTE></DIV><BR></DIV></BODY></HTML>