[Corpora-List] corpus syntax (and how we can use it to code meaning)

Rob Freeman lists at chaoticlanguage.com
Tue Sep 18 10:31:07 UTC 2007


On 9/18/07, Eric Atwell <eric at comp.leeds.ac.uk> wrote:
>
> On Tue, 18 Sep 2007, Rob Freeman wrote:
>
> > ..., we might use the context about a word or phrase to
> > select, ad-hoc, a class of words or phrases with are similar to that
> word or
> > phrase (in that context.) ...  we can use these true/not
> > true distinctions to select both syntax, and meaning, specific to
> context,
> > in ways we have not been able up to now.
>
> This suggests that corpus linguists should be interested in clustering
> or unsupervised machine learning of words into classes according to
> shared contexts; but they have been investigating this for some time,
> see e.g. papers in Proceedings of ICAME'86, EACL'87.
> The main difference between then and now is compute power: we can now
> use more sophisticated clustering algorithms, and cluster according to
> more complex context patterns, e.g. Roberts et al in Corpora, vol. 1,
> pp. 39-57. 2006.


Yes, people have been clustering words into classes according to shared
contexts for some time.

The point here is the idea that they need to cluster them into a different
class for each context in which they occur.

It is the goals of machine learning which I am suggesting need to change
(viz. a complete grammar), not the methods.

I think computational linguistics will get good results as soon as it stops
looking for global generalizations and clusters ad-hoc, according to
context.

But my impression is that most Corpus Linguists are not really that
> interested in unsupervised Machine Learning, i.e. letting the computer
> work out the grammar/semantics "from scratch"; they prefer to examine and
> analyse the corpus data "by hand" to select examples to back up their
> own theories...


Whether they are working "by hand" or not, people are not used to thinking
of syntax as ad-hoc generalization according to shared contexts. I'm
suggesting this idea needs to be taken out of machine learning (where it has
only been seen as a means to find "grammar" anyway, and not a principle of
syntax in its own right) and given a broader airing as a principle of syntax
on it own merits.

It might explain why MWE's tend to have the same "slot fillers" for
instance. Detailed analyses of what slot fillers can occur in a given MWE
could be done on the basis of what other contexts two words share and do not
share.

Corpus analysis currently tends to be done in terms of lexicon, what units
are repeated, how often. Corpus style syntactic analyses could be done on
the basis of what words share what contexts, and how well this predicts the
range of combinations they participate in, how MWE's change over time etc.

-Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070918/15555f30/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list