[Corpora-List] corpus syntax (and how we can use it to code meaning)

Rob Freeman lists at chaoticlanguage.com
Mon Sep 17 23:41:33 UTC 2007


My purpose in recent posts has been to give a new point of view on some old
arguments, and to propose some concrete solutions which follow from that new
point of view.

I want to summarize some of the more practical aspects of those solutions. I
have not simply been arguing to win points. I think there are practical
applications to be made of these ideas, that they can clarify the way we
think about language.


The Basic Idea

The basic idea relates to an observation first made over 50 years ago, but
which I claim was misinterpreted for 50 years.

The observation was that we cannot make global generalizations about natural
language structure.

What that means in practice is that there will always be generalizations
which can be said to be true _and_ not true about natural language
structure. This suggests we must make generalizations about language only
with reference to a context.

By analogy to formal grammar theory I summarize this by saying that natural
language grammar appears to "necessarily incomplete".


True and Not True

I have illustrated my point with small examples of generalizations that can
be said to be both true and not true of words in corpora.

For instance:


To Specify Syntax

Peter Howarth gives the examples of slightly disfluent constructions
produced by ESL students (Peter Howarth, Phraseology and Second Language
Acquisition, 1998 ):

e.g. "*_attempts_ and researches have been _done_ by psychologist..."

That we understand this, and yet that it seems odd, is explained by the
observation that "done" and "made" can be considered to have the same syntax
in some contexts ( e.g. "do/make a study"), but in the context of "attempt"
they do not have the same syntax (for most people?) So it is true that
"done" is in a class with "made" in some contexts (e.g. "a study") but it is
not true in other contexts ( e.g. "attempts".)

By this principle the more contexts two words share in common, the more
similar we might expect their syntax to be. While always being aware of the
possibility that in detail they may have different behaviour in a given
context, ( e.g. "attempts" clearly selects "make" and not "do".)

This might be useful to explain the seemingly random vagaries of syntax to
students in a language learning environment.

Or it might be used to improve predictions about what word sequences are
possible in speech recognition systems (the same could be said of phonemes.)



To Select Meaning

The generalizations given above are useful to predict syntax. But such
ad-hoc generalizations can be used in another way. They not only restrict
syntax in context specific ways. We can reverse our perspective and consider
syntax to select classes of ad-hoc generalizations appropriate to a token,
and associate these classes with meaning.

E.g. for the two sentences:

I supported the man with a stick.
I accompanied the man with a stick.

The words "supported" and "accompanied" can not only be thought of as being
selected by syntactic generalizations about the phrase "the man with a
stick", they can also be thought of as _selecting_ classes of syntactic
generalizations about "the man with a stick", which specify one or other
meaning for that phrase.

As I wrote earlier:

'For instance, if the word used selects a set of contexts which includes the
context "tomato plant" we will see one meaning ("supported" will do this),
but if it selects a class which does not include "tomato plant", we will see
another ("accompanied" will do this.)

Note: you need an ad-hoc treatment of syntax for this to work. Otherwise the
classes ("the man with a stick" = "tomato plant" or "the man with a stick"
!= "tomato plant") will be conflated, and "the man with a stick" will always
be the same.'

This could be useful for instance in search engines. Currently search
engines index only words and phrases. They do not distinguish the meaning of
a word or a phrase in the context in which it is used. According to the
method outlined here, we might use the context about a word or phrase to
select, ad-hoc, a class of words or phrases with are similar to that word or
phrase (in that context.) These then might be considered to specify a
meaning for that word or phrase. We could search not only for the word or
phrase, but that phrase used in the same sense.


Conclusion

While the arguments in the preceding threads have often been very
theoretical and abstract, in practical terms what I am saying is not
difficult. It just requires a slightly different way of thinking about
problems. In particular it asks us to consider that there will be things
which can be said to be true _and_ not true, of word associations in
corpora, depending on context, and suggests that we can use these true/not
true distinctions to select both syntax, and meaning, specific to context,
in ways we have not been able up to now.

Being able to have words and word groups which act both the same _and_ not
the same in terms of the ways they associate with other words in a given
corpus, means we cannot generalize a complete grammar for them. We will
never have a complete grammar for any natural language (beyond the corpus.)
We've really known this for a while. As Paula Newman noted it goes back at
least to Sapir: "All grammars leak." What we now realize is that these
"leaks" are not a bug, but a feature, as the programmers say. Paradoxically
it is this same seeming limitation which enables us to pack more information
into language than we would normally be able to, viz. the detail of
collocational restriction. Most importantly, recognizing such contrasts
exist enables us to unpack that information.

-Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20070918/7cbb7a55/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list