9.284, Disc: NLP and Syntax

Wed Feb 25 21:30:33 UTC 1998

LINGUIST List:  Vol-9-284. Wed Feb 25 1998. ISSN: 1068-4875.

Subject: 9.284, Disc: NLP and Syntax

Moderators: Anthony Rodrigues Aristar: Texas A&M U. <aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>

Review Editor:     Andrew Carnie <carnie at linguistlist.org>

Editors:  	    Brett Churchill <brett at linguistlist.org>
		    Martin Jacobsen <marty at linguistlist.org>
		    Elaine Halleck <elaine at linguistlist.org>
                    Anita Huang <anita at linguistlist.org>
                    Ljuba Veselinova <ljuba at linguistlist.org>
		    Julie Wilson <julie at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Zhiping Zheng <zzheng at online.emich.edu>

Home Page:  http://linguistlist.org/

Editor for this issue: Martin Jacobsen <marty at linguistlist.org>

=================================Directory=================================

1)
Date:   Tue, 24 Feb 1998 13:18:48 -1000
From:  "Philip A. Bralich, Ph.D." <bralich at hawaii.edu>
Subject:  Re: 9.276, Disc: NLP and Syntax

2)
Date:  Wed, 25 Feb 1998 08:43:15 -0500
From:  stephen p spackman <stephen at acm.org>
Subject:  Re: 9.255, 9.276 Disc: NLP and syntax

-------------------------------- Message 1 -------------------------------

Date:   Tue, 24 Feb 1998 13:18:48 -1000
From:  "Philip A. Bralich, Ph.D." <bralich at hawaii.edu>
Subject:  Re: 9.276, Disc: NLP and Syntax

At 12:36 PM 2/24/98 -1000, Dr. Joel M. Hoffman wrote:

>>You have the point rather backward here.  I am saying that if a theory
>>of syntax (whatever other things might be done in NLP) is to consider
>>itself a mature theory of syntax it should be able to produce programs
>>that meet those minimum standards I have outlined (appended at the end
>>of this message).
>
>Having seen some of the responses, I'd like to respond somewhat more
>forcefully to what seems like a very silly position.  There are two
>issues:
>
>	1.  If a theory isn't implemented as a parser is it
>necessarily bad?

The question is not as phrased above, but as follows:

        There is nothing in principle that would prevent all the
        theoretical mechanisms that have been proposed for theories
        from being implemented in a programming language other than
        the fact that these theories have not been completely worked
        out.

        Look at any rule or formalism from any theory and ask
        yourself what is it that prevents it from being implemented
        in a programming language.  There is nothing.  Then ask
        yourself why it is that those very minimal standards cannot
        be met even after 35 years of work and millions of dollars
        of spent resources.

You are obviously not reading the standards.  Take a look at them
closely (appended below for convenience).  They are very simple and
you will note that most people have assumed that that much at least
had already been accomplished by those theories and yet, in spite of
the fact that there is nothing that prevents their formalisms from
being programmed they cannot produce parsers that meet even those
minimal standards.

>	2.  Is the parser at http://www.ergo-ling.com/ indicative of a
>theory better than all others in existence.

> The answer to 1 is clearly "no," even though the converse may be
>true: implementation as a parser may indicate quality in a theory.
>
>But the answer to 2 is equally clearly "no."  Trying a handful of
>sentences shows that --- like most current "working" parser
>implementations --- the program relies on tricks whose underlying
>linguistics is dubious.

All the sentences that work on this parser meet the standards that
have been set.  No other parser comes even close on those sentences
that do work.  The criteria for a choice of parsers naturally has to
be the one that does the most.  Looking at a few sentences here and
there simply will not provide interesting information for anyone.  I
am currently putting together a follow up post to this one that will
propose a series of test sentences for each of the standards.  This
will help those who do not have a background in this area run their
own tests.

We are not yet at 90 or 95% accuracy with unrestrained input. That is
a ways off.  However, we are not the same as other parsers on this
score either.  We are doing 1000's of times better than they are.  For
example, in current speech rec systems, you can choose from a fixed
list of a few hundred short commands to do a little command and
control with your computer.  However, by adding the Ergo parsing tools
you can increase that to 1000s as indicated by the following sentence.

(could/would/will/can) (you) (please) open/grab/get/take/etc. the
file/document called/named //which/that/0 I called/named //that/which
is called/named 555.doc

This is not just a matter of being able to macth a few synonyms rather
it represents a parsing capacity that can be captitalized upon with
dozens of different structure types. You are right we cannot do
everything that comes across our table, but we can increase by
thousands the output that is currently available.  Reviewers and stock
holders will rave if you could just double the output from a few
hundred to 500.  Imagine how they will react if you increase that
total to 1000s.

In addition to the above, we can also offer questions answer statement
response repartee like the following.

the tall thin man in the office is reading a technical report

what is the (tall)(thin) man doing/reading?
Is the (tall)(thin) man reading a (technical) report

Thomas Jefferson was the third President of the United States
Who was the third President of the United States?

Again we cannot do 90% of anything anyone can throw at us, but we can
do 1000's of times more interesting searches than is currently
available on the web.  This is not a bad thing, and you can bet that
if the other theories could meet those standards in one tenth of the
cases that we can, they to would be citing the value of those
standards.

>For example, from:
>
>1.	In the beginning God created heaven and earth
>
>the program produces:
>
>2. Sorry, this is not a sentence, but it is a good adjective phrase.
>	Couldn't parse sentence...

Currently our parser does not treat the word "God" in its capitalized
sense.  For that reason you must type in "In the beginnings, THE god
created heaven and earth." Type in, "In the beginning, furniture
created heaven and earth" and you will find that works.  Your sentence
will work when we add the capitalized "God" to the dictionary with the
appropriate lexical features.

This is one good reason why I am working up a file of test sentences
to add with the standards.  It will give people a clearer sense of the
sorts of sentences that are possible.

>The claim is that the program that produces (2) from (1) is the best
>theory of syntax.  Is this to be taken seriously?  Have I missed
>something?

Yes you have.  Think of it like this:

Look at the standards closely.
        1)    Have I and many others assumed these theories could
              account for this much in a relatively trivial manner?
        2)    If they cannot implement these basics in a program,
              why not?
        3)    if only one theory can do this, does it mean that only
              one theory has made a proper analysis of this
              rudimentary level of syntax (and is therefore the best)?
        4)    what does this mean for syntax and linguistics overall?

Phil Bralich

>HERE IS A BRIEF PRESENTATION OF STANDARDS IN SEVEN AREAS:

1.  At a minimum, from the point of view of the STRUCTURAL ANALYSIS OF
STRINGS, the parser should:, 1) identify parts of speech, 2) identify
parts of sentence, 3) identify internal clauses (what they are and
what their role in the sentence is as well as the parts of speech,
parts of sentence and so on of these internal clauses), 4) identify
sentence type (without using punctuation), 5) identify tense and voice
in main and internal clauses, and 6) do 1-5 for internal clauses.

2.  At a minimum from the point of view of EVALUATION OF STRINGS, the
parser should: 1) recognize acceptable strings, 2) reject unacceptable
strings, 3) give the number of correct parses identified, 4) identify
what sort of items succeeded (e.g. sentences, noun phrases, adjective
phrases, etc), 5) give the number of unacceptable parses that were
tried, and 6) give the exact time of the parse in seconds.  > 3.  At a
minimum, from the point of view of MANIPULATION OF STRINGS, the parser
should: 1) change yes/no and information questions to statements and
statements to yes/no and information questions, 2) change actives to
passives in statements and questions and change passives to actives in
statements and questions, and 3) change tense in statements and
questions.

4.  At a minimum, based on the above basic set of abilities, any such
device should also, from the point of view of QUESTION/ANSWER,
STATEMENT/RESPONSE REPARTEE, he parser should: 1) identify whether a
string is a yes/no question, wh-word question, command or statement,
2) identify tense (and recognize which tenses would provide
appropriate responses, 3) identify relevant parts of sentence in the
question or statement and match them with the needed relevant parts in
text or databases, 4) return the appropriate response as well as any
sound or graphics or other files that are associated with it, and 5)
recognize the essential identity between structurally related
sentences (e.g. recognize that either "John was arrested by the
police" or "The police arrested John" are appropriate responses to
either, "Was John arrested (by the police)" or "Did the police arrest
John?").

5.  At a minimum from the point of view of RECOGNITION OF THE
ESSENTIAL IDENTITY OF AMBIGUOUS STRUCTURES, the parser should
recognize and associate structures such as the following: 1)
existential "there" sentences with their non-there counterparts
(e.g. "There is a dog on the porch," "A dog is on the porch"), 2)
passives and actives, 3) questions and related statements (e.g.  "What
did John give Mary" can be identified with "John gave Mary a book."),
4) Possessives should be recognized in three forms, "John's house is
big," "The house of John is big," "The house that John has is big," 5)
heads of phrases should be recognized as the same in non-modified and
modified versions ("the tall thin man in the office," "the man in the
office," the tall man in the office" and the tall thin man in the
office" should be recognized as referring to the same man (assuming
the text does not include a discussion of another, "short man" or "fat
man" in which case the parser should request further information when
asked simply about "the man")), and 6) others to be decided by the
group.

6.  At a minimum from the point of view of COMMAND AND CONTROL, the
parser should: 1) recognize commands, 2) recognize the difference
between commands for the operating system and commands for characters
or objects, and 3) recognize the relevant parts of the commands in
order to respond appropriately.

7.  At a minimum from the point of view of LEXICOGRAPHY, the parser
should: 1) have a minimum of 50,000 words, 2) recognize single and
multi-word lexical items, 3) recognize a variety of grammatical
features such as singular/plural, person, and so on, 4) recognize a
variety of semantic features such as +/-human, +/-jewelry and so on,
5) have tools that facilitate the addition and deletion of lexical
entries, 6) have a core vocabulary that is suitable to a wide variety
of applications, 7) be extensible to 75,000 words for more complex
applications, and 8) be able to mark and link synonyms.

Philip A. Bralich, Ph.D.
President and CEO
Ergo Linguistic Technologies
2800 Woodlawn Drive, Suite 175
Honolulu, HI 96822

Tel: (808)539-3920
Fax: (808)539-3924

-------------------------------- Message 2 -------------------------------

Date:  Wed, 25 Feb 1998 08:43:15 -0500
From:  stephen p spackman <stephen at acm.org>
Subject:  Re: 9.255, 9.276 Disc: NLP and syntax

I think it might be added that a further reason to doubt Philip
Bralich's (writing from Anne Sing's account?) criticisms of syntactic
theory is that the task of producing parse trees is not one that
humans can perform without a great deal of training (I suspect that
enough Americans were taught "sentence diagramming" in school that
they may not realise this; as a tolerably talented speaker of English
who first encountered the notion in college, let me assure you that it
is so).  Syntactic analysis *itself* is just a tool by which linguists
discuss their theories with one another; it is not a necessary part of
the result and may indeed turn out to be part of the "alchemy" phase
of the science. Certainly it is no more than the "particle" side of
the story, with the waves still to come.

Indeed, this should not be surprising, since printed text and even
well formed sentences no more than overlap with our primary data.

As regards sentence characterisation, there's real evidence that
speakers do not even _intend_ a distinction between questions,
statements and commands - let alone give them clear syntactic
realisations that one would wish to recover in either a linguistic
theory or a practical NLP system. Given a construction (common enough
in some styles of speech) like "It's cold in here?" (the question mark
here representing intonational cues) a speaker _listening to a tape of
themselves speaking_ only a few minuites earlier will frequently _ask
to hear their interlocutor's response_ before making a judgment as to
what their intention in the utterance was (this effect was noted by
Elizabeth Hinkelman while doing research on a related topic). If
that's "ambiguity", then the world is a stranger place than we knew
(it is, and it is).

To follow Bralich's analogy back to mathematics, he appears to be
taking academic mathematics to task because current mathematical
theories do not generate the "standard" notation for the integral
calculus: first, theory is not _about_ notation (though it employs
it); and second, the traditional notation of calculus, though a
"standard", is known to be technically incorrect (though useful enough
as neutral ground in paper discussion or when comparing one system
against another - in those cases where it happens to apply).

Then again, this note not intended as a defense of the current state
of syntax; we have many miles to travel yet!

regards all
stephen p spackman
<stephen at acm.org>

---------------------------------------------------------------------------
LINGUIST List: Vol-9-284