WordNet

Sat Feb 5 16:51:50 UTC 2000

Doug Cooper <doug at th.net> writes:

>Yes, I'm in the thick of it for Thai right now.  On the particular issue
of
>treating roots as a part of speech, I faced a similar set of problems,
and
>eventually decided that it mixes two distinct conceptual views, to wit:

>WN/EWN are hierarchies of semantic and physical relations between
>real-world concepts and objects, _not_ morphological relations between
>roots and derived terms -- even though that info is sometimes recorded,
>eg. between adjectives and adverbs; or inferred, as when plurals
>are stemmed.

I have been back and forth across this terrain many times, and here is
what I have found: Morphology is nothing but a kind of totally unreliable
clue.  In some languages it is more reliable than in others, but it is
still just an unreliable clue that can never be trusted 100%.  What is it
for then?  Suppose you are a child growing up in culture x having language
x.  Your brain must somehow figure out all the various parts of speech for
itself.  When you hear a strange word, the first thing you do is to notice
its morphology.  Then, using the clues the morphology provides, you
attempt to slot it within the syntax of the sentence in which it appeared.
 Most of the time, BINGO!  It fits!  But some of the time it does not, and
so you reserve judgment.  Etc.  So morphology speeds up the learning of
new words, but can never be relied upon as a reliable source of semantic
information.

Now, getting back to the ontology, one semantic node can link to several
words, and one word can link to several semantic nodes.  Words are sounds
or graphs in the outside world.  In other words, SYMBOLS.  The semantic
nodes of the ontology are NOT.  They are simply connecting points to words
and to other semantic nodes.  The real information lies in the ways in
which they are connected.  This is a hard thing for us to understand at
first, because all of our lives we have been working with symbols, and not
with these singularities called nodes, which only have meaning when they
are connected.

Because of the above facts, although the idea may be tempting, there
really is no shortcut in which one "stem" word can be relied upon to
reliably generate a paradigm of derived symbols.  Each semantic node in
the ontology has to be (tediously) linked to its appropriate "derived"
word.  This process can be automated somewhat (and thus speeded up) by
writing computer software capable of generating paradigms from the "stem"
word.  This saves a great deal of typing.  I have done this successfully
for English, and it should be easier for "purer", older languages like the
Austronesian languages we have been discussing.  But the results of this
generation process must then be carefully checked to make sure that each
"derived" word really links to the right semantic node, and is assigned
the right part of speech in the lexicon.

>The EWN xpos_near_syn relation (a typical cross-POS relation)
>is consistent with this view.  It is defined roughly as: "if (something)
>X's, then Y takes place" (eg. dies, death).

It is precisely because of such problems as this that in an earlier
posting I said that understanding the theoretical principles behind
semantic relationships is important.  Without a firm grasp, there are
simply too many things to go wrong.

Let us have a look at your example.  It deals with a semantic node linked
to die, dies, died, dead, dying.  Now recal this theorem:

All texts encode the maintenance of or else the assumption of various
states by various things, and the agents causing these maintenances or
assumptions of states--nothing else.

So what is being encoded by, say, "animal dies"?  "Animal" transitions to
the state of being "dead".  Thus in my words, "the state of being "dead"
flows to "animal".  Hence my phrase, "state flow".  This may seem silly at
first, but in fact it is very important, because it tells us what kinds of
syntactic relations can occur between "animal" and "dead" and between
"animal" and "die" at a surface (natural-language) level.  In other words,
this information can be used by an automated system for parsing and text
generation.

And this example of yours is an excellent one because it brings out a
further consideration.  Like "die", the verb, "kill", also results in
things being "dead".  But this fact is not explicitly encoded in "kill".
In other words, it may be there implicitly, but it is not so explicitly.
What IS there explicitly is that the thing being "killed" will end up in
the "killed" state, which is also "dead", but a little bit more.

What I am saying is that each verb has but one and only one EXPLICIT state
that it can transfer to its patient.  There may be other attendant
IMPLICIT states, but there is but one and only one EXPLICIT state, which
in the case of "kill" is "killed".

Now, the important thing for builders of ontologies is to clearly
understand these principles (among others the explicit-implicit thing just
described) in order to avoid entering erroneous information.  Thus, the
the semantic nodes for "die" and "kill" must be kept separate (not
merged), and the participle, "dead" must not be linked to the same
semantic node as "kill", and the participle, "killed", must not be linked
to the same semantic node as "dead", etc.

I would be happy to provide some coaching on these principles to save
others some time in the interest of science.  Also, my previous offer of
software or whatever it was, still stands for those willing to hold up
their end of the bargain (see my previous postings).  Meantime, should any
of you get interested and ask me for this, I will also have to go back and
dig up what I wrote, because it has now been so long ago that I have
forgotten (my personal ontology is fast fading).

>While for EWN,
>"preferably there is a morphological link between the two," the
>morphological relation isn't being encoded.  Rather, these are
>two words that relate to the same event -- a relation that is useful
>for, say, information retrieval (I may not be able to find a document
>about John's death, but I may see a headline like 'John Dies').

In an ontology like SEMLEX, the following relations might be found:

animal can die
die act death

There is no reason to limit an ontology to just hypernymy (isa), as many
have done before.  In fact, a good ontology should contain ALL of the
possible semantic links between semantic nodes.  The "act" relation,
above, means that "death" is the action of the verb, "die".  Etc.

>>If the simple fact that an adverb is derived from a particular
adjective,
>or a relation like xpos_near_syn, isn't sufficient, I assume that you
want
>to store information about how to generate derived forms from roots.

The generation of derived forms from roots, as I pointed out earlier, is
not the function of an ontology, but can be easily accomplished using
separate computer software.

>While this is entirely reasonable, a WordNet implemented in this way
>would be something more along the lines of a generative lexicon.
>Rather than storing (most) adverbs, the rule for making an adverb would
>be part of the adjective entry, and we would lose the programming
>advantages that come from having each node actually filled with all
>its values.  In WN, this stuff is external (eg. in the "morphy" tool).

I take it you mean because WN separates different POS into different
files, which as I pointed out before, is a serious weakness.

Suppose we take for example the adjective, "quick", and the adverb,
"quickly".  These two words MUST link to the same semantic node, as can be
seen by the following sentences:

I knew a quick girl.
She was quick.
She jumped quickly.

In all cases the state of "quick" passes to "girl", but the -ly tells us
that it is a direct dependent of the verb.  It is important to notice that
state flow is independent of grammatical dependency, and that these two
may run in either parallel or opposing directions.

>Consider the problem of intensifiers.  Do we make up a new semantic
>relation is_intensified_by (that points to the intensifier), or do we add
a
>subordinate note that has the word+intensifier pair (as 'black' has
hyponyms
>jet black, pitch black, coal black, etc.)?  WN (and I) take the second
>position for the simple reason that once the subordinate node is
>fully populated with additional values (like 'sable' or 'ebony'), we
>know all the values in this synset explicitly.

But the two words, "very black" must link to separate semantic nodes, with
an link like the following between them:

black cbe very

This link tells the automated system that black "can be" very, or that the
state of being "very" can flow to "black".  Does "intensely" mean the same
thing as "very"?  If so, then this state-flow potential can be seen by the
following:

black cbe intensely
blackness cbe intense

so that it is obvious that a state flow relationship always exists between
"very" and "black" in the direction of "black".  I have complicated the
issue by introducing a property noun, but I will not even attempt to touch
on that here.

>That, as you point out, "you get more semantic relations than you
>bargained for" is just a fact of life.

As it turns out, the number of semantic relations does NOT explode out of
control because of economization by hypernymy.  Although it would be
impossible to do this subject justice here, the essential principle is
just that if ducks are birds and geese are birds and sparrows are birds
and birds can fly, then ducks and geese and sparrows can all fly, so that
there is no need to forge separate semantic links for each of them.  The
same thing holds true for couples like "very black", because if "anyadj"
is a hypernym of "black", and "anyadj cbe very", then there is no need for
this relation to ever be repeated again.  Software can be written that
automatically slides relations up the hypernym tree in this fashion, thus
saving multiple thousands of unnecessary links from occuring.  And this
"sliding", in various forms, is part of machine learning.

>That we don't incorporate lots of
>language-specific info is also unavoidable, ...

On the contrary, ALL language-specific info MUST be incorporated if the
ontology is to be of any real value, and this can be done in a couple of
ways.  The first and simplest is the choice I prefer, namely to create
separate ontologies for EVERY language.  The second is to provide an extra
"language" field for each link and semantic node in the ontology, but this
quickly gets messy.  You end up with one huge ontology for several
languages with a lot of possibility for confusion and human error.

Then, if you go for my "separate ontologies" preference, in order to
relate language A to language B, it is necessary to go in and forge
linkages between the semantic nodes of the two ontologies and make sure
that these are correct.

I prefer this method because I perceive every language as a world unto
itself, and it is a valuable thing to be able to immerse onself into just
one single language and to see how everything in THAT particular language
works.

>and  IMHO is an issue
>for separate tools.  Using WN to manage sense definitions doesn't
>rule out a 'morphNet' or 'derivNet' for other kinds of relations, and
>would be cleaner for it.

Since I have already worked all of this out most elegantly, you would only
be reinventing the wheel.  What you need to do is to start with what I
have, and move on from there.  I am also a researcher, and so my ontology
is very much a work in progress, and this is why I am interested in
working with you.  In fact, it was just in December that I came upon a
discovery that will forever change everything, and because of which I have
been scrambling to upgrade all of my software.  Fortunately this kind of
thing does not happen every day, but it does, and no matter how much it
hurts to rewrite thousands of lines of code, when there is no alternative,
it must simply be done.

>Indeed, from my point of view the advantage of working with
>WN is that I can use its tools and semantic hierarchy as a skeleton for
>navigating a relatively poorly-lexicalized (or poorly documented)
>L2 data space, with minimal coding.   All I have to do is to tie each
>Thai head/sense ID to the appropriate WN node; possibly extending
>the implicit WN links; eg

But all you will end up with using this approach is a novelty that cannot
be used for anything except personal amusement.  What I am looking for are
tools that can be used for serious analysis and for machine translation.

>"thai_1_1=>foo#1" 'is a member of the foo#1 synset';
>"thai_1_2->foo#2" 'is a hyponym of the foo#2 synset'
>"thai_1_3<-foo#3" 'is a member of foo#3 which, in Thai, should itself
>be treated as a hyponym of its current English synset (because it's
>more of a distinct, well-lexicalized concept)'

>then write a little front end that incorporates this info into the
>data sent to/returned by the vanilla command-line WN front end
>(although you might find the perl toolset at
>http://www.ai.mit.edu/>jrennie/WordNet/  useful).

But using something like my SEMLEX, you can:

1. Add or delete any semantic relation type you desire while running the
program itself.

2. Do the same for any part of speech.

3. Name your parts of speech and semantic relations by any mnemonics you
desire.

The above are for my older version of SEMLEX.  In addition, a newer
version will be able to use any semantic node linked to a verb or
preposition as operator, so that one will be able to enter more specific
SVO information, such as

people eat rice
cats eat mice
girls wear dresses
sugar in coffee
boat on water
fly through air

Etc.

>Hope this is helpful,
>Doug

Same-same.

--CD.