Projet: Ontology

Tue Apr 25 17:56:28 UTC 2000

From: Patrick Cassidy <cassidy at micra.com>

                                                                   April
22, 2000

    The following note contains a follow-up to some discussions
held at the meeting of the Association for Computational
Linguistics (ACL) last year, and is now being brought to the
attention of a wider group.  This is being sent to a number
of different listservers, as well as the membership of the ACL
and I apologize for what will inevitably be some duplication.
    Please send all comments directly to me.

    Best regards,
    Pat

=============================================
Patrick Cassidy
MICRA, Inc.                      || (908) 561-3416
735 Belvidere Ave.               || (908) 668-5252 (if no answer)
Plainfield, NJ 07062-2054        || (908) 668-5904 (fax)
internet:   cassidy at micra.com
=============================================

To:    Members of the Association for Computational Linguistics
           and others with an interest in knowledge representation,
           lexicons, and lexical semantics
From:     Patrick Cassidy (cassidy at micra.com)
Subject:  A Request to Participate in a Study of the Utility of a
           Standard Ontology and Lexicon for Natural Language
           Understanding (NLU) and database interoperability

==============
Background
==============
    In recent years there has been a great deal of effort in
building lexicons, ontologies, and terminologies, both for the
purposes of basic research and for practical applications.  The
advantages of common formats and common content to allow reuse
of results between groups has been widely recognized, but the
practical funding situation has required in most cases that
individual groups focus on relatively narrow aspects of the
general problem.   Efforts have also been underway for years
within and between a number of groups to develop common
resources to promote interchange of data and to compare results,
and to reference and organize the results of the many groups
who have prepared valuable resources.  These very valuable
projects have helped mitigate the difficulty of preparing and
finding useful ontologies and lexical resources. However,
there is still little prospect that these multiple
projects will lead in the near future to a unified common
ontology and lexicon that has sufficient detail and
functionality to be adopted by a large number of groups as
a reference standard, and which can be used directly without
substantial modification for a variety of purposes
in research and practical applications.  Of special value
would be the development of a common defining vocabulary of
concepts and associated words and relations that would be
sufficient to define all of the specialized concepts and words
used in applications.  The ability to use a common vocabulary
to define the concepts and words in diverse applications will
provide a level of interoperability unavailable by any
other means, except for one-by-one coordination between
projects.   The question arises whether it is now possible
to build on the large body of existing data and experience,
to construct such a reference standard within a
tightly coordinated single project.  The goal will be to create
a database that is as inclusive as possible of all of the
results and intuitions resulting from previous research and
development efforts, and to include as many as possible of the
current practitioners within the project to build this resource.
    The main problem is that development of a basic but
realistically large ontology and lexicon for Computational
Linguistics research will require a project to coordinate a
group -- probably a consortium of dispersed academic and
industrial participants -- of a size that will require substantial
funding.  Though large by the standards of most NLP research
projects, such a coordinated effort would still be modest by
comparison with funding for important research tools in other
areas of science, such as space probes, particle accelerators,
or telescopes.  Skepticism about the possibility of congressional
funding for such a project is understandable, but there is ample
precedent for obtaining special congressional funding of tools
for research.  What is needed is to show that the costs will be
repaid by the usefulness of this database both for research
and for construction of advanced applications.  At a minimum
there should be a survey to identify the potential users of
a standard ontology and lexicon.  In the eventuality that
special congressional funding could not be obtained, this
will still be useful to help move toward building common
resources by other means.

    At the annual meeting of the ACL in Maryland in June 1999
I helped organize a "birds-of-a-feather" meeting to discuss whether
there is at present a need and an opportunity to build a large
but basic ontology and lexicon for use in NLU research and
applications.  Among the 23 that participated in the discussion, most
had expended some effort building lexicons and ontologies for
natural language understanding, but some members were present
who had not themselves participated directly in such efforts.  We
spent over an hour discussing mostly the technical question of what
kind of ontology could be useful for natural language understanding,
and the political questions of whether it would be practical to
attempt to get agreement at this time among ontology developers
with different views of how to proceed.  The view was almost
unanimous that such a project should be attempted, though it was
recognized as technically and organizationally complex.  There was
also a large degree of skepticism as to whether we could convince
congress to fund such a large project.  We had hoped to be able to
have a wider discussion among the general membership of the ACL,
but as it turned out the general business meeting ran well over its
allotted time, and when I raised the issue there was no time for
discussion, so a motion was made and passed that I should form a
committee to study the question and report back to a future meeting.
This note is the first request for participation in such a committee.
    The question of construction of a reference ontology for
Computational Linguistics and for database interoperability has
already been discussed over several years within the ANSI T2 ad
hoc committee on ontologies.  That ad hoc committee is no longer
actively meeting, and this note and its suggested formation of
a study committee is in part an attempt to fill the void left
by discontinuation of those discussions.   One of the conclusions
of those discussions was that substantially increased funding
would be needed for a coordinated effort, in order to move the
development of useful ontologies beyond the current stage in which
isolated groups each pursues its own ideas, which are generally
incompatible with or very difficult to merge with those of other
groups.  The present note is intended to bring the issues
addressed by the T2 committee to a wider group, and to form
a committee that can develop objective information that would
provide justification for the substantial funding needed for a
unified project.

   As mentioned, the complexity and size of such a project, which
would require a tightly coordinated effort with funding substantially
larger than a typical CL research project, makes it likely that
special funding would have to be obtained directly from congress.
To obtain such funding it will be necessary to show that there is
a significant group of established researchers who have been active
in building lexicons and ontologies, and who believe that building
a standard reference is technically feasible at present, and that
such a reference would be used widely enough to justify the expense.
One can find expressions of such a belief in private conversations
and in published papers, as well as in the existence of research
efforts to build common lexical and ontological resources.  To begin
the process of developing a well-organized proposal that can be
considered seriously by congress, what is needed is a more formal
study to present the findings of a broadly representative group
rather than of an individual or single research group.  This request
for participation in this study is only a first step in developing
such a proposal.
    The specific purposes for organizing this committee and the subjects

for discussion are:
(1) to determine the general characteristics of an ontology and
lexicon that would incorporate as much as possible of the results
and insights of those who have already spent many years doing research
on lexicons, ontologies, knowledge representation, terminologies,
and lexical semantics, and would be broadly useful for both research
and applications; and
(2) to estimate where and to what extent such a database, if built,
would in fact be used.  Quantitative data about potential
areas of use would be especially valuable, to demonstrate that
construction of such a database would be worth the cost.

    The structure of this committee is open to discussion.  I would
suggest that anyone with experience in any of the relevant fields
should be able to vote on any proposals for which a measurement of
opinion is needed, and those individuals wishing to participate as
voting members should inform me of that before the end of May.
Discussions will be conducted by e-mail (I will forward comments to
a list of interested persons), unless someone is willing to set up
a listserver for this purpose (perhaps an existing listserver should
be used?).  Individuals willing to prepare a report of the potential
uses of a defining ontology/lexicon in specific areas of research
or in applications would receive and summarize copies of any data or
suggestions relevant to their area, sent from any interested person.
The number of possible summaries is not limited, but will probably be
small.  Any individual is free to make any comments, and all comments
received will be forwarded to anyone wishing to receive them, unless
they are specifically intended not for distribution.  I do not
anticipate that at this stage any degree of agreement could
be reached about any details of the structure of a common ontology
or lexicon, but some summary could be prepared of the various
alternatives that might be suggested.  I hope that at the NAACL-2000
meeting in Seattle in the first week of May, some preliminary
indication could be obtained about how many individuals would be
willing to participate as voting members and/or report writers.
I do not have a fixed timetable in mind, but probably three months
will be sufficient time for interested parties to determine
potential uses and send in comments.  The timing of subsequent
actions will depend on the wishes of the voting members of the
committee.  All persons interested in this project in any way
should contact me by e-mail (cassidy at micra.com) or telephone
(908-561-3416).  Suggestions about how to organize an
informal study of this type would also be welcome, but need to
be sent soon to be useful.

    It will be worthwhile to include in this study a summary of
all ontological and lexical resources currently available, and
I hope that some representative of every group that has built
any form of ontology, terminology, or other lexical resource,
which is now available to the public or might become part of a
common reference ontology/lexicon, would send me a brief summary
of their projects and a reference to the location of any existing
data available publicly.  There are already several web sites
on which pointers to the locations of such resources are listed,
and the owners of those sites and those who have prepared other
lists of available resources are encouraged to send a copy of
the lists they have already prepared.  The complete summary of
references to such resources submitted will be published as
part of the report of the committee.

    The data that are most needed to determine potential utility
of a reference database will be estimates of how much such a
common ontology or lexicon would be used.  For this purpose, anyone
who would be likely to even try using it should send a note indicating
the type of system in which it would be used and how it would be
used, and how much more efficiently the system might function.
I would expect that anyone currently using an ontology or semantic
network would want to try such an ontological lexicon, and if there
are those who would not try it, the reasons for this skepticism
will probably serve as useful input.
   One of the important questions to be answered is whether one can
estimate potential utility in quantitative terms, and if so, how.
The likelihood of the ontology being used in one's own system
may be expressed in any way, but at least three levels can be
distinguished: (1) those who would be willing to participate in
construction of such an ontological lexicon; (2) those who would
be likely to adopt a standard ontology or lexicon, if it existed;
and (3) those who would try using a standard ontology or lexicon,
to test its utility.

    Descriptions of potential commercial uses would be especially
valuable for convincing congress that funding is justified.
For example, estimates have been made that electronic commerce
over the internet will amount to 425 billion dollars by 2001 (IEEE
Intelligent Systems, Jan/Feb 1999 "Let's Go Shopping" by Michael
McCandless, pp. 2-4).  Labor costs in sales transactions tend to run
about 10%, so the costs of executing those transactions would be
about 40 billion dollars.  If these costs could be reduced by 1% due
to efficiencies generated by the use of a standard knowledge
representation scheme, those cost savings would amount to 400
million dollars per year. The total cost of the development of such
an ontology would then be paid back in less than 6 months.  One can
make similar estimates for other activities which use advanced
computer programs, and find similar likely savings.  Thus even a
miniscule improvement in the efficiency of computer programming
or the use of computer programs would appear to make this project
cost-effective.  However, estimates of this type will be far more
convincing if there are those involved in the development or use of
programs which have or should have semantic elements, and who
could provide more accurate and objectively-based estimates for
specific examples.
    In the best case, an industrial group who maintains a database
that already uses an ontology to enhance its functionality
might estimate, for example, that an ontology of the type
described would likely improve the efficiency of the program
by, say, 5%.  This number, multiplied by annual sales of the
program, could provide a crude estimate of economic benefit.
There are several obvious difficulties in making such estimates,
starting with the fact that we don't know what the final
database will look like.  But even very crude estimates from
people familiar with a potential use will be better than wild
guesses from those with little familiarity.  Groups which
have already built an ontology or a semantic lexicon can review
the costs of development of their own system and determine, if
a common ontology would be useful, the direct cost savings that
would occur in adopting a standard ontology rather than constructing
an enhanced version of their own system.

    Even without an economic justification of that type, building
this database should be justifiable even if it is used primarily as a
research tool.  Accordingly, I hope that we can obtain comments
from all individuals who would be likely to use such a tool in their
research or in building applications, as well as those who wish to
comment on the desirable structure of such a database.
    I plan to organize a birds-of-a-feather meeting at the
upcoming NAACL-2000 conference in Seattle (April 29-May 3) where
those who are willing to consider serving on this committee can meet,
and discuss questions of form and substance of a study such as this,
as well as any comments that have been received at that point.
Accordingly, responses should be sent to me by e-mail if possible
before the 27th of April, or they can be presented and discussed
at the meeting in Seattle.  This study will continue for at least
three months, so comments will be welcome and are likely to be
valuable after the meeting as well.
    In the discussions I had concerning this topic with other
attendees at the 1999 ACL meeting, the first question was of course
what type of ontology is being proposed.  The general structure as
well as detailed technical questions can only be resolved in the
course of preliminary discussions among those who will participate
in the construction of the database, as well as in the construction
phase.  But for the sake of discussion, I have described below some
characteristics that will likely need to be included in such a
database.  The final form of the ontology, if it is to be useful for
Computational Linguistics, will have to include substantial lexical
knowledge, or will have to be tightly integrated with lexicons built
separately.  Rather than call it an "ontology" it might better be
referred to as an "ontological lexicon," although there should be a
core conceptual component in the ontology which will be language-
neutral.  One of the purposes of formation of this committee is to
obtain a wider range of comments concerning desiderata for the
structure of such a database.
    In addition to questions about how such an ontological
lexicon would be structured, many at the ACL meeting had other
questions.  I have reproduced below most of the questions that were
asked, and indicated some potential answers.  It may well be that
nothing suggested here will ultimately find itself accepted
unchanged in the final result of construction of this database, but the
important issue is that construction of some such a database will be
essential to provide a common tool that will permit more effective
widespread collaboration in research toward human-level
understanding and generation of language.

========================================
What Kind of Ontology is Being Proposed?
========================================
    What is being discussed here is the need for a database having
two main components: (1) an upper ontology of fundamental concepts,
represented in logical format, which are sufficient to serve as the
building blocks for construction of all of the more complex concepts
that are used in any given field; and (2) a basic lexicon of defining
words, in which the word meanings are represented using the same
set of fundamental concepts, and which are sufficient to define
all of the words of the language.  Each word in the lexicon will
also have an associated definition using the defining vocabulary,
which will in some cases look like an ordinary dictionary definition.
Over time, both the ontology and lexicon can be expanded to
include more specialized or less common concepts, but the main
goal for the initial phase should be to specify the minimum set
of defining concepts, semantic relations, and axioms for the
ontology, and the minimum set of defining words for the
associated lexicon.
    This description evades some controversial issues regarding
what constitutes "words" and "definitions".  It is understood that
many polysemous words have vague or plastic meanings, dependent
on context, and for such words an exhaustive list of meanings
cannot be specified; and many words cannot be defined by necessary
and sufficient conditions.  What can only be recorded in a
database of this kind are the necessary characteristics of
word meanings, and perhaps some markers indicating when variations
in meaning can be expected in linguistic usage.  This will be
an attempt to record as much as can be agreed on about basic words
and concepts at the present state of the field.  Applications that
need to handle ill-defined words will need additional structure
beyond what can be included in a standardized lexicon.
    The conceptual component of this database would be equivalent
to an "upper ontology" or "top ontology" (although this term is
used by different people to indicate ontologies of somewhat
different sizes).  Specifying the meanings of words using a basic
ontology of this type constitutes in effect a theory of the
meanings of the words.  A realistic lexicon will need to include
not only single words, but fixed collocations and probably also
word combinations that are not normally considered idioms but
have some non-compositional character.  The lexicon can include
not only the word meanings in logical format, but any other
data associated with word meaning or usage which is useful
for applications.  For example, in addition to part-of-speech
or etymological data, the lexicon could include verb case frames
which would be duplicative to some extent of data in the verb
definitions, but in a different format, perhaps easier to use for
some purposes.  Statistical data on word associations would be
another useful component.  Though not essential, it could be
easily included when available.

    Specifics of what will be included and how the data will be
structured can only be decided by those participating in the
construction of the database; the remaining comments in this
section are personal suggestions, which may not be adopted by
the project participants.

   The conceptual elements in the ontology will be defined in a
logical format, but there are two principles which could make
the database more widely acceptable and easier to use:
(1) concepts which are not lexicalized in any language as
single words or fixed collocations can be included in the
ontology, but should be used only where there is some cogent
need; and all concepts in the ontology will have an associated
definition in some language (usually English).  (2) Ideally
there will be a "definition parser" that can take such a defining
string and produce the logical structure that it is intended
to define.

    The emphasis in this project is on the most general words and
concepts, so that a common defining vocabulary of concepts can
be developed which, if used for defining terms in specific
applications, will allow some significant level of conceptual
communication between applications developed by independent
groups.  Applications that process complex information but
are not required to understand linguistic phrases, such as
database applications or electronic commerce, can use the
ontology, and in theory could ignore the lexicon.  Linguistic
applications would use the lexicon, and, if any level of
conceptual understanding is required, would also
use the word definitions in logical format, which will usually
also require the use of the basic ontology.  (In some cases
a linguistic application may use the lexicon and associated
definitions with minimal reasoning, and the lexicon would
function in such cases as a thesaurus or simple semantic
network, such as WordNet).

   Different ontologies have already been developed by a number
of different groups for various purposes, but in general their
structures are so different that transferring information from
one system to another is very time-consuming or error-prone.
The difference between this ontological theory and others which
have been proposed thus far lies mostly in the size of the database
and the extent to which it will both include and represent a
consensus of the different theories (i.e., ontologies and lexical
semantic representations) that have been developed thus far by
independent groups.  What would be very useful for both research
and applications development is to have at least one well-developed
defining vocabulary freely available to all potential users,
constructed by representatives of most or all of the existing ontology
and lexicon groups and containing as much as possible of the
compatible information which each of these groups could contribute
to a common effort.  In addition to the core database, user interfaces
and applications programming interfaces should be developed, as an
integral part of the project, to make the database as easy as possible
to learn and use.
    The representations of the concepts, and through them the
meanings of words, will need to be specified ultimately at a logical
level that will allow automatic reasoning.  The existing Knowledge
Interchange Format (KIF) and Conceptual Graphs (CG) standards could
serve as well-defined theory-neutral formats for storing the meaning
representations.  To be useful for computational linguistics, a
considerable amount of lexical information should also be included.
This distinguishes the proposed database from that of CYC, which
placed primary emphasis on utility in reasoning.  Another important
distinction is that the database must be public domain or at least
freely and easily available over the internet for research, such
as is the WordNet system.  Without the free availability to any
potential research or applications group, developing the
necessary agreements between groups may be impossible, and most of
the utility will be lost.
    The ontology that will emerge from such a project will most
likely have some variant of the typical structure of a set of entities
connected by relations, since this is the basic model of meaning
representation which has been universally adopted, though with
some significant differences between implementations.  The
relationships may be thought of as semantic relations or as axioms
of the ontology, but it is understood that to be useful for reasoning
the semantic relations must be defined with sufficient precision that
the logical implications of one entity having a specific relation to
another can be calculated unambiguously.  Although in many ontologies
the hierarchy has receive the most attention, it is equally important
that the semantic relations be fully agreed upon and well-defined.
The set of basic concepts and semantic relations needed will be
those which are necessary and sufficient to provide logical
definitions of any of the concepts, and by extension, words,
which will be used in applications.  In effect, what is needed is
to create a dictionary with definitions of the words, and a parallel
ontology with the same definitions expressed in a logical format
suitable for automatic reasoning.  The lexicon that labels the
concepts of the ontology should include all of the basic words that
are needed to define all of the other words of the language; the
"words" of the language must eventually include all collocations
which are to any degree non-compositional, that is, whose meanings
cannot be deduced as a predictable combination of the meanings of the
individual component lexical strings.
   The lexicon cannot at the initial stage be comprehensive, but it
should also contain those common collocations, such as those which
are produced by the lexical functions of Mel'cuk, which are either
essential for generation of fluent colloquial language, or so
commonly used that their inclusion will improve the speed or
accuracy of the language understanding process.
    As a practical matter, to demonstrate the potential uses of
such an ontological lexicon and to facilitate development of a user
interface that will permit widespread use, there should be a detailed
implementation of this basic defining vocabulary to define
specialized concepts in at least two different areas.  Two that come
to mind are, for example, the medical area, where the basic defining
vocabulary could be integrated with the UMLS system and its
metathesaurus; and the military area, where significant effort has
already been expended to apply the CYC ontology.  These two are
by no coincidence areas of interest to governmental agencies.
Integration with other specialized ontologies or lexicons might be
proposed and performed by individual groups as part of the project.
Enterprise models, manufacturing, electronic commerce or planning
ontologies would be additional candidates.
    The primary motivation for developing a common theory of
meaning is to allow a greater degree of re-use of research results in
computational linguistics, as well as more direct communication
between different implemented systems which have a linguistic or
conceptual component.

============================================
Why do we need a common defining vocabulary?
============================================
   Any difference between two systems in the internal representation
of words or concepts must inevitably lead to some difference in the
inferences that the two systems make from the same data.  Thus
without some common basis for defining the meanings of the different
concepts used in different systems, the transfer of knowledge
between systems will be impossible, time-consuming, or
highly error-prone.  The need for a common vocabulary of defining
concepts is felt not only in the field of natural language
understanding, where communication is the primary goal, but also in
other fields of Artificial Intelligence, wherever conceptual
information painstakingly entered into one system could be useful
in another system.

    It is clear that in some areas of research in Natural Language,
semantic representation of word meanings is less important than in
others.  Research in speech-to-text conversion, for example, and in
parsing methodologies, has progressed without the use of semantics.
Statistical methods have also been shown to be useful for some
practical purposes, though the extraction of the meanings of texts is
beyond the capabilities of such a methodology by itself.  It is also
true that groups doing research with systems which will not interact
at a conceptual level with other systems have a great degree of
freedom in choosing representations of meaning which may be
suitable for their purposes even if not usable in other systems.  We
would hope that groups whose research does not immediately
require detailed semantic representation of meanings will
nevertheless recognize its importance for the progress of research in
language understanding, and not raise objections to this project
unless the objections address the feasibility of the goal.
    The developers of an ontological lexicon will be those
groups working specifically on methods to represent word
meanings, but the need for a common representation of meanings of
words and texts is felt directly also by those whose research involves
some level of understanding, such as in information extraction,
message understanding, word sense disambiguation, text categorization,
machine translation, and database interoperability.
    The difficulties caused by a lack of common conceptual
representations impact not only NLU and the database and expert
systems that CYC has been applied to; it affects many areas of AI.
In a recent issue of the IEEE Intelligent Systems (January/February
2000) several commentators discussed the state of AI and some of
those comments reflect this problem indirectly:
Nils Nilsson commented that "AI shows all the signs of being in
what the late Thomas Kuhn called a pre-paradigmatic, pre-normal-
science stage.   It has many ardent investigators, arrayed in several
camps, each claiming to have the essential approach to intelligence
in machines.. . .  It might be that intelligence  is the kind of
multiplex for which no single science or paradigm will ever emerge."
Donald Michie stated: "The most notable nontrend [in AI] has
resulted from consistent disregard of the closing section, Learning
Machines, of Turing's 1950 paper. A two-stage approach is there
proposed:
1.  Construct a teachable machine.
2.  Subject it to a course of education.
   Far from incorporating Turing's incremental principle, even the
most intelligent of today's knowledge-acquisition systems forget
almost everything they ever learned every time their AI masters turn
to the next small corner of this large world."
A common basis for representation of knowledge will help to
overcome these problems, and help to move more toward the normal
scientific paradigm, enabling more rapid advances by allowing
investigators to investigate the same phenomenon and compare
details of results more directly.  In computational linguistics
research, having at least one common detailed theory of word
meanings for the defining vocabulary will provide a powerful tool
for progress toward the ultimate goal of human-level language
understanding.

===============================================================
Wouldn't it be better to develop a common ontology cumulatively
by contributions from existing research groups rather than try
to build a larger unified project?
===============================================================
   The construction of an ontological lexicon for natural
language understanding is different in several important ways from
most areas of scientific research, where ideas and results from small
independent groups provide the bulk of the individual contributions
to evaluate or elaborate the theories of each field.  The
predominance of original contributions from small groups is true in
most areas of natural language research as well, but for construction
of a large ontology and lexicon for use as a tool in research, the
usual research process less effective.  The main problem is the size
and complexity of a realistic ontology, and the intimate and multiple
interrelations of its component parts.  Specifying the meanings of
the defining vocabulary is to build a fundamental ontology of concepts
and then to construct a theory of the meanings of words using those
concepts.  This endeavor has more of the character of an engineering
project than of a research project, in that it is the construction of
an artifact which has many complex interacting parts.  It may be in
theory possible to achieve the same result eventually through small
independent contributions of ideas and elements, but such a process
is likely to be much slower than a coordinated project, and will be
less likely to achieve the goal of a widely accepted reference
sta`ndard within any foreseeable time frame.  In addition, the time
lost in pursuing the development of a common ontology through
uncoordinated effort may well prove eventually much more
expensive, through the lower efficiency both of research and of
implemented programs developed in the interim, than would the
development of the same database by a single adequately funded
coordinated effort.  Furthermore, the problems of coordination of
groups with different approaches to ontology development,
admittedly difficult even in a single properly funded project, might
well be insurmountable without the impetus of deadlines for
agreement on specific subproblems within an overall plan of
development.
   One possible alternative is the elaboration of an existing
ontology, such as the WordNet, by the cumulative addition of new
functions or data.  This will, one may hope, proceed in any case
until a coordinated project is funded.  But in order to accumulate
into a unified system, there would still need to be a prime
coordinator - in this case presumably the WordNet group.  Their
own views would then necessarily predominate, and since these
have been driven by specific goals and objectives, which are
different from the goals of other groups, the resulting database
would not represent the best common approach to the varied
problems, as would a project initiated de novo for the specific
purpose of answering a wide range of research and practical goals.
It is also difficult to imagine that the total cost of proceeding
in that fashion would in the end be any less than a single
coordinated project, which would also contain input from WordNet
as well as from other existing systems.
    The worst-case scenario is one in which several commercial
concerns develop proprietary versions of a natural-language
ontology, of which the largest part is not publicly available.  That
is currently the case with the CYC project, and it appears to be the
direction in which Microsoft's "MindNet" project is heading.  If
such a situation develops, there will not be one but several
competing "standards", none of which will be easily available to
researchers, and even if available to some degree, will not be able
to be enhanced and redistributed by most of those who could improve
such a system.  Such systems will not serve the purpose of providing
a common test bed in which new ideas for representing word
meanings can be tried by many research groups in realistically large
systems, with results distributed to the research community at large.
Proprietary systems are also likely to be less reliable than a public
one and their behavior unpredictable to anyone outside the
development group.

=================================================================
Would non-U.S. groups be eligible to participate in this project?
=================================================================
    Much important work on ontologies has been performed
outside of the U.S., and I would expect that participation by non-
U.S. groups would be welcomed, indeed would be essential if the
resulting ontology, which should be language-neutral, is intended
to serve as a standard throughout the scientific community.  Since
the emphasis would be on creating a defining vocabulary of
general concepts sufficient to define all specialized concepts,
the experience of those whose native language is other than
English will be particularly valuable to recognize when
useful basic concepts are lexicalized in one language and
not in others.  There are already several European projects
which are aimed at the construction of common ontological and
lexical resources, and it would be great loss if those groups
did not participate in an inclusive effort.

    The language-specific elements of the lexicon will of necessity
concentrate first on English, since creating a computational lexicon
even of one language is already a very large task.  Groups from the
UK could of course work on the English lexicon.  But if at all
possible, groups with experience in automatic translation or other
multilingual applications should be requested to participate, since
some of the more subtle and difficult problems in knowledge
representation may be highlighted by the difficulties found in
accurate translation.
     It is difficult to predict to what extent the inclusion of
lexicons for other languages will be feasible; groups which
presently concentrate on translation will presumably want to
include their parallel lexicons for languages other than English.
Ideally, the European research funding agencies might fund European
groups willing to coordinate their work with this project, who
could concentrate on non-English languages.

================================================================
My notions of how to represent concepts changes every few weeks.
How can we fix on a single representation at this time?  Do we
know enough at present to justify a major project?
================================================================
   It goes without saying that an ontological lexicon, like the
language it represents, will change over time, but a legitimate
question is at what point it is appropriate to undertake a first
effort to construct a standard tool that can be used and tested
by the entire research community.  There have not been any major
fundamental changes in the prevailing entity-relationship paradigm
for representing knowledge over the past ten years, and the paradigm
has been sufficiently well investigated at a fundamental level that
there seems to be no reason to delay trying to build a consensus
ontological lexicon based on the best knowledge now available.
This will provide a research tool that can help to discover the
strengths and weaknesses of different aspects of this paradigm,
and it can include all the elements deemed important by those who
have been studying meaning representation for some time.  The
database can then be thoroughly and widely tested for conformity
to the realities of language use, and for utility in reasoning
about data.  The main motive for this project is the observation,
from prior experience, that the fundamental concepts of any language
are so intimately connected with each other that no theory of the
meaning of any of its component concepts can be tested in a realistic
setting unless some consistent representation of the entire
fundamental vocabulary is available.  We therefore need some
starting point with a realistically large database representing most of
the fundamental concepts of a language, in order to make effective
tests of whether any specific individual components conform to the
way people actually use words and concepts.

================================================================
For how long will the ontology constructed be useful?  Isn't it
likely to change and need modification or replacement?
================================================================
    Based on the lifetimes of existing ontologies, we can expect
that a major effort at developing a standard ontology will result in a
database that will be useful for research and practical purposes for at
least ten years.  To avoid getting outdated, the ontological lexicon
will need a core group to provide continuing effort at maintenance,
at a minimum level of effort possibly five times less intense
than for the initial development.  It is conceivable that eventually
some fundamentally different structure for meaning representation
will be proposed and widely accepted, in which case it would be
difficult to predict how much of the structure of this proposed
ontology would be reusable.  But more likely the ontology will
continue to be useful for decades by modification, replacement, or
addition of new components, with most of the structure remaining
stable for years.  It is also unlikely that any new meaning
representation paradigm could gain wide acceptance unless some
substantial effort such as this provides a basis for thorough testing
of the entity-relation model on a realistic scale.
    As a theory of the meaning of words, this database will
doubtless be modified and elaborated, as are most scientific theories.
Theories in general are tools for organizing research; they provide a
framework in which to formulate tests to confirm or refute aspects
of the theory.  They are useful for a time to make collaborative
research on a topic possible, after which they may be modified or
abandoned.  In a theory with as many individual parts as an upper
ontology, we can assume that some parts will be found inadequate
for some purposes, while others may remain unmodified for a long
time.  The core maintenance group, or perhaps a committee with
broad representation, would be responsible for making and
publicizing the changes in each new revision.  Having this theory
easily available to the entire research community will maximize the
likelihood of finding and addressing inadequacies in its structure.

=============================================================
Ontologies have not been shown to be notably useful for NLU.
Why spend resources building a bigger one?
=============================================================
    There is apparently a widespread notion that ontologies, and
specifically the CYC ontology, have been tested for utility in
Natural Language Understanding and have not proved useful.  It is
important to address this perception.  In fact, attempts to use CYC
in natural language have been very modest in terms of time spent, and
the main virtue of CYC, its logical structure, has scarcely been
tested at all in NLU applications.  It is also important to recall
that CYC was not designed with use in NLU as a primary objective
(as would the ontological lexicon suggested here), although Lenat
had expected it would be useful for that purpose.  CYC has two
other important flaws which would not apply to an ontology built
as suggested here -- (1) CYC was built by a single group with
a specific viewpoint, and did not include input from many other
practitioners of diverse schools of knowledge representation,
ontology and lexical semantics.  Regardless of its internal
consistency, it cannot serve as a focus to bring together a large
number of groups to use it as a common reference standard; and
(2) most of CYC is not publicly available, and use of CYC
presents difficult legal issues.  Although it can be useful
for specific industrial contractors, its lack of public
availability make it unsuitable for use as a research tool;
even when made available to academic groups, detailed results
of research cannot be freely described, nor modified versions
redistributed to other groups.
    The study that may most directly account for the perception
of CYC's inadequacy was performed in 1996 by Nirenburg's group
at NMSU ("An assessment of Cyc for Natural Language Processing",
MCCS-96-302, available on the Web at:
http://crl.nmsu.edu/Research/Pubs/MCCS/Abstracts/mccs-96-302.htm).
This study of the utility of CYC for Natural Language research
found that several desirable features were absent.  It did
not, however, suggest that the existing structure could not be used,
rather that it needed additional components or structures to be more
useful.  It did not make any negative conclusions about ontologies
generally, and indeed that study group has its own ontology which
it finds more directly useful for its purposes.
    Perhaps of greater relevance is the widespread use of
WordNet and EuroWordNet.  Although this semantic network does
not qualify as a logic-based upper ontology as would the basic
ontology which would be constructed as suggested here, it does
contain many conceptual relations which would probably be
widely accepted as part of the larger ontological lexicon
which could be constructed if adequate funding were available.
The wide use of WordNet does provide strong evidence that
when well-structured and easily usable resources are publicly
available, they will prove to be valuable tools for research.
This is scarcely surprising, as progress in many types of
research is limited by the tools available.
     Since there has not yet been an ontology constructed with
even close to the amount of detail that is needed for understanding
of language, it is far too early to draw conclusions as to how
Useful a fully-developed and publicly available ontology would be.
One of the purposes of developing a comprehensive ontological
lexicon would be to discover how useful the present ideas about
knowledge representation really are, without the impediments of having
multiple small and incompatible sets of data on word meanings.
Smaller ontologies have in fact been shown to be useful to some
extent  in language-understanding tasks, such as disambiguation, but
thus far those available have not been shown to dramatically
improve performance.  Nor should they necessarily.  As mentioned,
a comprehensive ontology does not by itself constitute a language-
understanding system, there are many additional aspects of
language understanding systems that must be developed as well.
     Although an ontology is not the only component of a
language understanding system, or even the main one, and its
usefulness depends directly on the systems in which it is used,
some form of common ontology is a necessary prerequisite for sharing
research results in language understanding, wherever the actual
meanings of linguistic expressions need to be represented.  Many
specialized ontologies have been constructed which are not
designed to be used in language understanding.  But until a common
representation of word meanings is used by more than one or two groups,
advancement toward human-level understanding of language will be very
difficult and is likely to be slow and inefficient.  The proposed
ontology will be one intended to be useful for NLU as well as for
other purposes, such as database interoperability.  It will therefore
need to be connected intimately with the lexicon, and as much as
possible of the type of detailed lexical information that is found
in Melcuk's Explanatory-combinatorial dictionary will have to be
included.  As mentioned above, what is needed is better thought
of as an ontological lexicon.

====================================================
Would there be any images or graphical information
representation in the ontology?
=====================================================
    It may be true that some degree of imagery or graphical
representation may be required to adequately represent certain
concepts or word meanings.  Whether it will be feasible to include
such data in the first version of an ontological lexicon will have
to be decided by those participating in the organization of the
effort.  It will be helpful if individuals who have worked on
graphical information representation were to participate in this
study.

==============================================================
Different people use different internal ontologies, and
to some extent different lexicons.  How can we include
all of those differences in a single consistent database?
==============================================================
   In order to serve as a completely accurate medium of
communication between agents, the word senses of a language must
be identical between speaker and listener, or some degree of
miscommunication or ambiguity will result.  It happens in human-
to-human communication that use of words in different senses by
different people causes errors in the communication process.  It will
also be true that in human-to-computer communication similar
differences in internal representation will lead to some
miscommunication, though this can be eliminated in computer-to-
computer communication.  Special procedures for recognizing when
variants of meaning are being used will probably have to be part of
the implementing systems, and may not be includable in the
ontological lexicon itself.  Words that are commonly used in variant
senses, or have productive polysemous meanings, can be marked as
such, and the broadest senses can be included, even though the
procedures for recognizing variants of meaning may not be
contained within the lexicon.  These are the cases where recording
collocational use may be especially helpful to disambiguate the
sense.
     It is necessary to build at first a basic lexicon and ontology
of words which identifies the most common senses that are used by
almost all native speakers of a language, and from that subsequently
to build up and include less common or idiosyncratic variants,
wherever such variants have some significant level of usage.  The
differences in their internal lexical representation that people
have, if they are sufficiently widespread, may have to be treated
similarly to multiple discrete senses of words, or the semantic
plasticity of polysemous words.  In the real world, of course widely
variant use of language can be observed; any idiot or psychotic
individual may produce a string of seemingly linguistic utterances
that are completely uninterpretable by any other person, however
skilled in the language used.  The project is intended to produce
only a basic reference vocabulary, and the recording of highly
individualistic, poetic, and idiosyncratic usage of words will be
beyond its scope.  Most specialized uses will have to be dealt with
by specialized systems built to handle such variation in usage.
It is the common defining vocabulary which would be the main concern,
though the inclusion of some standardized or common uses of specialized
technical words will be valuable, limited only by the time and
resources available for extension of the database core.

=================================================================
Will funding for construction of such an ontology reduce funding
for other areas of Computational Linguistics?
=================================================================
    In any recommendation made to congress for funding of this
project, it must be strongly emphasized that the creation of a
standard ontology/lexicon will not substitute for other aspects of
computational linguistic research, but is only a tool for such
research.  The reduction of funding for other aspects of CL research
would be counter to the purpose of building the ontology, and would
squander the resource that would be built at significant expense.
Those who contact funding agencies or members of congress to
recommend this project need to be sure to emphasize this point.

======================================================================
Will recommendations by an ACL committee for congressional funding
constitute lobbying and jeopardize the tax-exempt status of the ACL?
=======================================================================
     A study of public issues which includes comments on the
need for and effects of government action does not constitute
lobbying, and is performed routinely by institutions and think tanks,
such as ECRI, without affecting their tax-exempt status.  The ACL
will not as an institution make recommendations directly to
members of congress.  Individuals who are interested in the subject
may cite an ACL study to support the need for funding.  An
unfunded and relatively informal study of this type is unlikely
by itself to carry sufficient weight to move congress to action,
but ideally it could prompt the organization of a more formal study
of the need for funding of a standard ontology, for example by the
National Academy of Sciences, or by think tanks concerned with
technical issues, whose opinions are valued by members of
congress.

=======================================================================
How can we expect that ontologists and lexical semanticists with
different viewpoints could ever be induced to agree on a common
approach?
 ========================================================================

     It will indeed likely be difficult to forge agreements on
specific issues, but where there is a recognition of the need for
compromise, it can be accomplished.  Building research resources is
in many respects an engineering rather than a research activity, and
the mindset required for such a task is quite different from the
attitudes which are successful for basic research.  One example of
this difference was eloquently narrated in Kip Thorne's book "Black
Holes and Time Warps" in which he described the analogous
difficulty in coordinating several teams, each accustomed to basic
theoretical research, in a new effort to design and build an expensive
interferometric detector for gravity waves:

"Within each team the individual scientists had free rein to invent
new ideas and pursue them as they wished for as long as they
wished; coordination was very loose.  This is just the kind of culture
that inventive scientists love and thrive on, the culture that
Braginsky craves, a culture in which loners like me are happiest.
But it is not a culture capable of designing, constructing, debugging,
and operating large, complex scientific instruments like the several-
kilometer long interferometers required for success.
  To design in detail the many complex pieces of such
interferometers, to make them all fit together and work together
properly, and to keep costs under control and bring the
interferometers to completion within a reasonable time requires a
different culture: a culture of tight coordination, with subgroups of
each team focusing on well-defined tasks and a single director
making decisions about what tasks will be done when and by whom.
  The road from freewheeling independence to tight
coordination is a painful one.  . . ."

   He continues that with reluctance, and prodding from the
funding agency, the freewheeling and independent scientists made
the necessary adjustments.  An ontological lexicon for
Computational Linguistics is of course a different type of research
tool from a gravity-wave detector (and probably of much more
immediate practical utility), but the need to build a unified structure
which is tightly coordinated and internally consistent may be even
greater than that for building physical measuring instruments,
because of the likely sensitivity in an ontology to inconsistencies
between even widely separated parts.  Given the imperative for close
coordination in ontology construction, is there a plausible way to
achieve the necessary cooperation of groups with disparate
viewpoints?  I will suggest one possible scenario.
    If the prospect of organizing development of a standard
ontology, as suggested here, reaches the stage where funding looks
like a realistic possibility, discussions or a conference should be
organized among those who would want to participate in its
construction, to determine how many of the disparate systems could
be integrated into a single consistent system.  In such discussions,
the teams will develop some appreciation of the likelihood that their
own views may or may not be adopted, intact, or in modified form.
Since the most important goal will be to create a database that will
be used by the largest number of research teams, at some point
disagreements about what formats or approaches to adopt will
probably have to be resolved by some form of voting among
participating groups, and he project director will need to
be able to resolve any issues not amenable to the voting approach.
Any group which recognizes that its own approach is incompatible
with the majority and is likely not to be adopted, can try to argue for
its technical superiority, but if the arguments are not accepted, such
a group will face the choice of participating and adapting its own
system to the dominant approach, or not participating, and
continuing its own independent line of research.  There will
presumably be some groups interested in exploring novel
approaches to knowledge representation that will want to continue
along lines different from that adopted by the majority.  However,
from discussions I have held with people involved in investigation
of word meanings, there appears to be a wide recognition of the
need for some common database, and many or most are likely to
participate in such a project.
    By the time that project proposals need to be submitted, there
should be some preliminary agreement as to the likely outline of the
general structure of the database that will be developed.  The
disagreements over details will need to be resolved in the course of
actual funded development, but there will need to be some mechanism,
whether by voting of an executive committee or decision of a project
chairperson, to resolve residual disagreements by fiat.  The manner
of selection of the project chairperson would ideally include
substantial input from the likely participants in the project.
    It is likely that to accommodate input from as many as
possible of existing groups, the number of persons funded for this
project will approach or exceed two hundred over an initial
development stage of three to five years.  The required funding for a
project of that size will be close to two hundred million dollars
($200,000,000) over the five years.  This will almost certainly
require a special appropriation from congress.  Other areas of
science, including highly theoretical fields with little immediate
practical applications, have succeeded in obtaining funding for
projects comparable to and often much larger that this (the
*annual* maintenance budget of the Hubble telescope is about
$200 million).  The possibility of congressional funding is
realistic, provided that an adequate justification can be agreed
upon among practitioners in the field.  That is the purpose of
forming this committee, and I hope that all of those who may have
some use for an ontological lexicon will respond with information
about potential uses that will allow us to demonstrate the
cost-effectiveness of such a project.

___________________________________________________________________
Message diffusé par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.biomath.jussieu.fr/LN/LN-F/
English version          : http://www.biomath.jussieu.fr/LN/LN/
Archives                 : http://web-lli.univ-paris13.fr/ln/