[Corpora-List] ad-hoc generalization and meaning

Sat Sep 15 09:02:24 UTC 2007

Hi Paula,

Sorry to come down hard on your earlier post. It was a cumulative reaction
to of a number of messages which seemed to question not only my concrete
suggestions, but any desire to move from the status quo at all.

On 9/14/07, Paula Newman <paulan at earthlink.net> wrote:
>
>  Rob,
>
> Re:
> RF> does the study of language have to be divided up in the ways you
> describe?
>
> Of  course not.  I was providing a framework in which to ask a question,
> namely, what is the purpose of your  proposal?
>  Is it to further the study of language?  To develop methods of
> implementing NL processors?
> To form the basis for new formalisms useful in both contexts? To develop
> new types of corpus annotation? Or?
>
> And that was just to get at (i.e. pin down) what you are actually
> suggesting.
>
> Perhaps another way of getting there is via another question:
> given that you have an idea in mind that you seem to think is new, how
> would you pursue it?
>
> That words, meanings, and the contexts in which they occur are
> interdependent is well known.
> What new approach are you proposing to deal with that fact?  People have
> been struggling over it for years, on both theoretical and practical levels?
>

No-one has suggested a treatment of syntax which makes generalizations about
word associations, ad-hoc, in context.

This is of immediate relevance for machine learning. In machine learning
work it is the goal of finding a complete grammar which needs to change,
nothing else.

But the question "how would you pursue it" is as broad as the subject. As I
say the implications for machine learning are that we should stop looking
for complete grammatical descriptions of corpora (and focus instead on
software for generating very precise incomplete generalizations, at will.)

That is just the beginning. Grammatical incompleteness doesn't just suggest
we should stop trying to label texts automatically, it suggests we should
stop trying to label texts at all. What is the purpose of labeling your text
if someone else can label it another way, and be right too.

If we must label then we need to focus on talking about justifications for
labels, not the labels themselves. Labels only give a point of view. (In
principle corpus linguists already reject labels. In practice many use them,
and their provisional status is not always clear. Formal incompleteness
gives those corpus linguists who reject grammatical summarizations of
corpora a first principles explanation _why_ corpora can't be summarized.)

It suggests changes in the way we should teach language. If the corpus is
the most complete description of a language, then we should teach examples,
not grammar. If grammar can only be understood in terms of ad-hoc
generalizations over examples, then grammatical explanations of language
will be meaningless in the absence sufficient exposure to examples.

There are implications for search engines. I'm suggesting language works
much like an indexed search engine (ad-hoc search.)

As fields of technology natural language and indexed search are currently in
disconnect. They should be the same.

Arguably indexed search is already the most successful "natural language"
technology of all time (check my definition: it does stuff with text, and it
makes money...) But while search engine results can be seen as ad-hoc
categories of "meaning", these categories are currently found by search
engines solely with reference to lexicon. (The information currently given
to search engines is a bit like Mike Maxwell's "syntaxless" example:
"garden-the-to accompanied tomato-plant-his Tom".) If we now have a theory
for the way syntax selects meaningful categories, ad-hoc, from text, in
principle we could have a Web search that reflects the syntax of your query,
not only the words. (And do this properly, mark you! Attempts to apply
natural language to search have failed up to now because our model of
natural language, and what we find works for search, have been different.
Search indexes and clusters ad-hoc, NLP tries to find global classes. I
would base them both on the same ad-hoc search model. The current model of
search, which finds ad-hoc patterns among documents by indexing them on
words, would be integrated with a model of meaning which uses ad-hoc syntax
to make subtle distinctions between different uses of the same
word--distinguishing different uses of "the man with a stick" for instance.)

While we are indexing information more effectively, why stop at text? A
model of language based on ad-hoc classes suggests why speech recognition
does not work well. As I pointed out, this problem of incompleteness was
first observed in phonemic categories. But what did linguistics do? Our
reaction was to drop phonemics as a study and let the engineers get along as
best they could! Now we can help them. The answer is that the categories of
speech need to be treated on an ad-hoc basis, not learned globally in hidden
markov models.

Taking this to its logical extreme, it says things about the way we need to
model knowledge. There is currently a vast disconnect between the way
computers work and the way we think (exemplified by the way language works.)
This is evident in the gap people see between "formal models" and natural
language. If we can bridge that gap the possibilities are breathtaking.

How much more do you want me to write?

RB> I think the idea of "informal grammar" is a muddle too. I don't think
> grammar is "informal", I think it is "necessarily incomplete".
> OK, I thought it was your term.  But, and as you have been advised many
> times, everyone knows the latter.  The observation that  "Any grammar leaks"
> is a very old one.  I used to think it was by Jane Robinson, but I've
> recently seen an attribution to Sapir.
>

Absolutely. However while we all seem to know this "informally", formally it
has been ignored. This knowledge has not changed what we do one bit. "Gee,
all our grammars leak. Oh well, better just look harder."

If all grammars leak, why are we still looking for grammars? How about
turning that around. Maybe the "leaks" are the system we've been looking
for.

The history of science is full of such reversals. Maybe one is needed here.

-Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070915/a0f24c06/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora