[Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics withR'--re Louw's endorsement

Linas Vepstas linasvepstas at gmail.com
Sun Aug 17 22:28:43 UTC 2008


2008/8/16 John F. Sowa <sowa at bestweb.net>:
> Wolfgang,
>
> The fact that some approach has been inspired by cognitive theories
> does not disqualify it from being applied to corpora.  And there's
> no reason why you can't mix and match multiple methods of various
> kinds -- logical, analogical, statistical, heuristic, or whatever.
>
>  > A number of responses I have received via the list or in private
>  > suggest that the future will see the integration of corpus
>  > linguistics with cognitive approaches.  I disagree.
>
> I have no idea what you mean by "integration" or why you assume that
> a cognitive approach must be based on introspection:

Hmm. I know what "integration" means, but not "introspection".

Here's a simple example of "integration"; its so simple that
it hasn't been published, but valid anyway:

I have a parser (the link-grammar parser) that I use in an
engineering context (as a stepping stone to something else)
It often produces multiple parses for a given sentence. The
question arises: are some of these "more correct" than others?
Which ones?  To obtain this answer, I compute the mutual
information of word-pairs from a large corpus. I use these
mutual-info scores to rank the quality of different parses,
by assuming (a priori, without experimental support) that the
ones with a higher mutual info are more likely to be correct.
It seems to be quite effective, based on a seat-of-the-pants
evaluation.

Anyway, the above is an example of "integration" -- a merger
of statistical, corpus techniques with a parser built up from
old-fashioned "introspective" parse rules.  Its not
particularly pure, grounded in elegant theory or anything,
but a good engineering hack that works.

The integration, doesn't stop there. I'm trying to use similar
statistical techniques to wed the parser output to word-sense
disambiguation, based on both word-net similarity scores,
*and* on mutual info scores.  Attempt reference resolution
across sentences. And then feed that back to rank parses,
and perhaps even automatically discover  new parse rules.

Again, its not "elegent" in the classical sense of being a
simple yet encompassing, powerful theory; its instead
a crazy contraption of gears and levers acting every
which way; but its goal is effectiveness, not research
per se.

The question in my mind is this: at what point will linguistics
feel that it has mined out all that can be gotten from basic,
"simple"  approaches (taggers, parsers, statistical
colocations, HMM models, etc.) and start doing research
into combining various different techniques in various ways?
That, for me, is the watershed: when no single technique
is primary, but when the combination of techniques dominates.

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list