[Corpora-List] Bootcamp: 'Quantitative Corpus Linguistics with R'--re Louw's endorsement

Linas Vepstas linasvepstas at gmail.com
Tue Aug 19 22:15:49 UTC 2008


One last, short reply to Geoffrey, and I'll shut up.

2008/8/19 Geoffrey Williams <geoffrey.williams at univ-ubs.fr>:
>
> On the other hand, you might not be interested in corpus linguistics and meaning
> at all if you are primarily concerned with extracting patterns. This would mean
> that you might not need a corpus, in our terms, at all, but a mass of data. This
> will be fine to extract patterns, but will tell you nothing about meaning for
> which the interplay of lexis and syntax in context are essential. This is not a
> problem as many a linguist would be delighted to test and refine the tool once
> it has become available on sourceforge.

To be concrete: I am currently using the link-grammar parser. It consists of
a large number of hand-crafted "rules", or "predicates" or "patterns" that
state which words are allowed to appear with other words in a grammatically
valid sentence. These rules are coarsely dependent on "meaning" - or rather
on part-of-speech - in that, once a particular pattern is matched, then the
part of speech is also identified. On rare occasions, the grammar can also
be a more refined indicator of "meaning", or at least "dictionary definition":
some dictionary senses of a word can only be used with certain grammatical
patterns.

My goal (one of many) is to automatically "learn" new patterns, and correct
and refine existing ones, so that link-grammar can correctly parse a broader
range of sentences. I've hardly even (not even) started on this; I have nothing
to show here.

What about "meaning"?  I use link-grammar as input to "RelEx", a
"relation extractor". It uses a number of "patterns", "rules" or "predicates"
to identify subject and object, prepositional relations, etc.  These rules
are also hand-crafted, but perhaps too, these can be "learned" by some
automatic mechanism.  The output of relex is conceptually similar to
that of dependency-grammar parsers, which can "learn" new languages
on their own.  Perhaps I will one day replace relex by a dependency
parser, or perhaps make it more dependency-parser-like, or use
dependency parser techniques to learn new relex rules.

But you hit the nail on the head when you say "[its] fine to extract
patterns, but
will tell you nothing about meaning":  for me, the "meaning" is how the output
of one stage becomes the input for the next.  Simply extracting a pattern is
useless, for engineering purposes: I have to know what it "means" to use it.
And, by "use it", I mean: "feed it to the next stage of processing."   Thus the
architecture is pragmatically mentalist: hand-craft a certain set of core,
basic rules/relations. These can be understood, and provide an anchor in
"meaning", and define the "output" which later stages use as "input". Once
this core relationship is articulated, the remaining rules can be automatically
learned, or so is the hope. -- and the newly learned rules generate new,
novel output, presented to the next level, which in turn might learn on seeing
this new input (provided that the learning hews close to the established,
hand-crafted ruleset).

To be concrete: the "next stage of processing", in this example, are the
word-sense disambiguation algorithms, the reference-resolution algorithms,
and the knowledge-base-annotation algorithms. So, if the system parses
a sentence, and processes it, and compares it to what is already present
in the knowledge-base, and finds that "it already knew that", then it can also
conclude that "the parse must have been correct". Statistically, it can
reinforce the rules and patterns applied along each stage: they lead to
a consistent world-view, Weltanshaung; these rules and patterns must
have been correct.

What is "meaning" in this framework? It is the shared consensus reality
between the automated system, and humans. If the system reads, and
learns, that the sky is blue (except on Mars, where its not), who are we to
argue that the system has not automatically grappled with "meaning"
(Searle's chinese room be damned)?  I'd love to be able to argue a
more nuanced definition of "meaning", but cannot.  I have high hopes
that, having assembled the system, and explored it a bit, its deficiencies
will become clear, and the doors will open, allowing a more refined
debate as to the meaning of meaning.

Credit where credit is due: the ideas above are not original.  There
are battalions of grad students at various universities building similar
systems right now. There's a group near me, at UTexas, Austin, doing this.
I assume another under John Sowa. Several headline-worthy startups
are working this vein, notably Powerset, recently acquired by Microsoft
for $100M!  wow - talk about the squish of juicy money.

In my case, my patron is Ben Goertzel, of Novamente. He's the big
thinker behind all this. The software I describe, little of it created by
me, *is* publically available:

http://www.abisource.com/projects/link-grammar/  under BSD license

Relex: at
http://opencog.org/wiki/RelEx  or https://launchpad.net/relex, under
Apache license.

OpenCog: at http://opencog.org/ under GPL Affero license.

Is anyone here likely to be interested in this software? Probably not:
I use them the way chemists use beakers and rubber hose: of no
particular intrinsic value, other than what you actually do with it.

Does the software currently "learn new patterns" by means of
"statistical corpus analysis"? Barely, no, not really automaitcally.
I have large databases of tens of millions of word pairs and
cpu-months of pre-parsed texts which get monkeyed with daily.
I'll let you know if something impressive comes about.

--linas

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list