[Corpora-List] Trial Dataset available for Task4 at SemEval-2007

Wed Jan 3 17:05:59 UTC 2007

[Apologies for duplications] 
[Please distribute widely]

The trail dataset for SemEval-2007 Task 4
(Classification of Semantic Relations between
Nominals) is now available. Interested participants
can access the three main files from the following
websites:

(1) Guide for Use of Trial Dataset,
http://docs.google.com/View?docID=w.df735kg3_5ckpbv6

(2) Relation 7: Content-Container,
http://docs.google.com/View?docID=w.df735kg3_3gnrv95

(3) Relation 7: Training Data,
http://docs.google.com/View?docID=w.df735kg3_8gt4b4c

Participants and prospective participants in Task 4
are invited to post their comments, questions, and
contributions on the Semantic Relations group at
http://groups.google.com/group/semanticrelations.

The task description is listed below and can also be
found at
http://nlp.cs.swarthmore.edu/semeval/tasks/task4/description.shtml.

The Task4 organizers

---------------------

SemEval-2007	
Task #4: Classification of Semantic Relations between
Nominals

Roxana Girju1, Marti Hearst2, Preslav Nakov3, Vivi
Nastase4, Stan Szpakowicz5, Peter Turney6, Deniz
Yuret7

July 28, 2006

1. Department of Linguistics, University of Illinois
at Urbana-Champaign, girju at cs.uiuc.edu
2. School of Information, University of California,
Berkeley, hearst at sims.berkeley.edu
3. Department of Electrical Engineering and Computer
Science, University of California, Berkeley,
nakov at cs.berkeley.edu
4. School of Information Technology and Engineering,
University of Ottawa, vnastase at site.uottawa.ca
5. School of Information Technology and Engineering,
University of Ottawa, szpak at site.uottawa.ca
6. Institute for Information Technology, National
Research Council of Canada,
peter.turney at nrc-cnrc.gc.ca
7. Department of Computer Engineering, Koc University,
dyuret at ku.edu.tr

1.Description of the Task

There is growing interest in the task of classifying
semantic relations between pairs of words. However,
many different classification schemes have been used,
which makes it difficult to compare the various
classification algorithms. We will create a benchmark
dataset and evaluation task that will enable
researchers to compare their algorithms.

Rosario and Hearst (2001) classify noun-compounds from
the medical domain, using a set of 13 classes that
describe the semantic relation between the head noun
and the modifier in a given noun-compound. Rosario et
al. (2002) classify noun-compounds using a multi-level
hierarchy of semantic relations, with 15 classes at
the top level. Nastase and Szpakowicz (2003) present a
two-level hierarchy for classifying noun-modifier
relations in general domain text, with 5 classes at
the top and 30 classes at the bottom. Their class
scheme and dataset have been used by other researchers
(Turney and Littman, 2005; Turney, 2005; Nastase et
al., 2006). Moldovan et al. (2004) use a 35-class
scheme to classify relations in noun phrases. The same
scheme has been applied to noun compounds (Girju et
al., 2005). Chklovski and Pantel (2004) use a 5-class
scheme, designed specifically for characterizing
verb-verb semantic relations. Stephens et al. (2001)
use a 17-class scheme created for relations between
genes. Lapata (2002) uses a 2-class scheme for
classifying relations in nominalizations.

Algorithms for classifying semantic relations have
potential applications in Information Retrieval,
Information Extraction, Summarization, Machine
Translation, Question Answering, Paraphrasing,
Recognizing Textual Entailment, Thesaurus
Construction, Semantic Network Construction, Word
Sense Disambiguation, and Language Modeling. As the
techniques for semantic relation classification
mature, some of these applications are being tested.
Tatu and Moldovan (2005) applied the 35-class scheme
of Moldovan et al. (2004) to the PASCAL Recognizing
Textual Entailment (RTE) challenge, obtaining
significant improvement in a state-of-the-art
algorithm.

There is no consensus on schemes for classifying
semantic relations, and it seems unlikely that any
single scheme could be useful for all applications.
For example, the gene-gene relation scheme of Stephens
et al. (2001) includes relations such as "X
phosphorylates Y", which are not very useful for
general domain text. Even if we focus on general
domain text, the verb-verb relations of Chklovski and
Pantel (2004) are unlike the noun-modifier relations
of Nastase and Szpakowicz (2003) or the noun phrase
relations of Moldovan et al. (2004).

We will create a benchmark dataset for evaluating
semantic relation classification algorithms, embracing
several different existing classification schemes,
instead of attempting the daunting chore of creating a
single unified standard classification scheme. We will
treat each semantic relation separately, as a single
two-class (positive negative) classification task,
rather than taking a whole N class scheme of relations
as an N class classification task (Nastase and
Szpakowicz, 2003).

To constrain the scope of the task, we have chosen a
specific application for semantic relation
classification, relational search (Cafarella et al.,
2006). We describe this application in Section 2. The
application we envision is a kind of search engine
that can answer queries such as "list all X such that
X causes asthma" (Girju, 2001). Given this
application, we have decided to focus on semantic
relations between nominals (i.e., nouns and base noun
phrases, excluding named entities).

The dataset for the task will consist of annotated
sentences. We will select a sample of relation classes
from several different classification schemes and then
gather sentences from the Web using a search engine.
We will manually markup the sentences, indicating the
nominals and their relations. Algorithms will be
evaluated by their average classification performance
over all of the sampled relations, but we will also be
able to see whether some relations are more difficult
to classify than others, and whether some algorithms
are best suited for certain types of relations.

2.Application Example: Relational Search

For some of the tasks that we mention in Section 1, it
might be argued that semantic relation classification
plays a supporting role, rather than a central role.
We believe that semantic relation classification is
central in relational search. Cafarella et al. (2006)
describe four types of relational search tasks.
Although they focus on relations between named
entities, the same kinds of tasks would be interesting
for nominals. For example, consider the task of making
a list of things that have a given relation with some
constant thing:

    * list all X such that X causes cancer
    * list all X such that X is part of an automobile
engine
    * list all X such that X is material for making a
ship's hull
    * list all X such that X is a type of
transportation
    * list all X such that X is produced from cork
trees

For these kinds of relational search tasks, we do not
need a complete, exhaustive, non-overlapping set of
classes of semantic relations. Each class, such as X
causes Y, can be treated as a single binary
classification problem. Any algorithm that performs
well on the dataset (Section 4) and task (Section 5)
described here should be directly applicable to
relational search applications.

3.Semantic Relations versus Semantic Roles

We should note that classifying semantic relations
between pairs of words is different in several ways
from automatic labeling of semantic roles (Gildea and
Jurafsky, 2002), which was one of the tasks in
Senseval-3.1 Semantic roles involve frames with many
slots, but our focus is on relations between pairs of
words; semantic roles are centered on verbs and their
arguments, but semantic relations include pairwise
relations between all parts of speech (although we
limit our attention to nominals in this task, to keep
the task manageable); FrameNet2 currently contains
more than 8,900 lexical units, but none of the schemes
discussed above contain more than 50 classes of
semantic relations.

Each slot in a frame might be considered as a binary
relation, but FrameNet and PropBank3 do not make a
consistent effort to assign the same labels to similar
slots. In FrameNet, for example, the verb "sell"
("Commerce_sell") has core slots "buyer", "goods", and
"seller", whereas the verb "give" ("Giving") has core
slots "donor", "recipient", and "theme". There is no
matching of the similar slots, although they have very
similar semantic relations:

<sell, buyer> ↔ <give, recipient>

<sell, goods> ↔ <give, theme>

<sell, seller> ↔ <give, donor>

Semantic relation classification schemes generalize
relations across wide groups of verbs (Chklovski and
Pantel, 2004) and include relations that are not
verb-centered (Nastase and Szpakowicz, 2003; Moldovan
et al., 2004). Using the same labels for similar
semantic relations facilitates supervised learning.
For example, a learner that has been trained with
examples of "sell" relations should be able to
transfer what it has learned to "give" relations.

4.Generating Training and Testing Data

Nastase and Szpakowicz (2003) manually labeled 600
noun-modifier pairs. Each pair was assigned one of
thirty possible labels. To facilitate classification,
each word in a pair was also labeled with its part of
speech and its synset number in WordNet4. When
classifying relations in noun phrases, Moldovan et al.
(2004) provided their annotators with an example of
each phrase in a sentence. We will include parts of
speech, synset numbers, and a sample sentence for each
pair in our training and testing data.

Consider the noun-modifier pair "silver ship", in
which the head noun "ship" is modified by the word
"silver". Using the classification scheme of Nastase
and Szpakowicz (2003), the semantic relation in this
pair might be classified as material (the ship is made
of silver), purpose (the ship was built for carrying
silver), or content (the ship contains silver). Note
that parts of speech and WordNet synsets are not
sufficient to determine which of these three classes
is intended. If "silver" is labeled "a1" (adjective,
WordNet sense number 1, "made from or largely
consisting of silver") and "ship" is labeled "n1"
(noun, WordNet sense number 1, "a vessel that carries
passengers or freight"), then the correct class must
be material. However, if "silver" is labeled "n1"
(noun, WordNet sense number 1, "a soft white precious
univalent metallic element"), then the correct class
could be either purpose or content. We would represent
this example as follows:

"The <e1>silver</e1> <e2>ship</e2> usually carried
silver bullion bars, but sometimes the cargo was gold
or platinum." WordNet(e1) = "n1", WordNet(e2) = "n1",
Relation(e1, e2) = "content".

We will begin by choosing seven relations from several
different schemes (e.g., Nastase and Szpakowicz, 2003;
Moldovan et al., 2004). We will focus on relations
between nominals. For the purposes of this exercise,
we define a nominal as a noun or base noun phrase,
excluding named entities. A base noun phrase is a noun
and its premodifiers (e.g., nouns, adjectives,
determiners). We do not include complex noun phrases
(e.g., noun phrases with attached prepositional
phrases). For example, "lawn" is a noun, "lawn mower"
is a base noun phrase, and "the engine of the lawn
mower" is a complex noun phrase. The markup will
explicitly identify entity boundaries, so the teams
that attempt this task will not need to worry about
finding entity boundaries (e.g., in "<e1>macadamia
nuts</e1> in the <e2>cake</e2>", we can see that the
first entity is a base noun phrase and the second
entity is a noun; there will be no need for a chunking
parser).

For each of the chosen relations, we will give a
precise definition of the relation and some
prototypical examples. The definitions and examples
will be available to the annotators and will be
included in the distribution of the training and
testing data. 

Given a specific relation (e.g., content), we will use
heuristic patterns to search in a large corpus for
sentences that illustrate the given relation. For
example, for the relation content, we may use Google
to search the Web, using queries such as "contains",
"holds", "the * in the". For each relation, we will
use several different search patterns, to ensure a
wide variety of example sentences. The search patterns
will be manually constructed, using the approach of
Hearst (1992).

The collected sentences will be given to two
annotators, who will create positive and negative
training examples from the sentences. For example,
here are positive and negative examples for content
(below, "!=" means "does not equal"):

"The <e1>macadamia nuts</e1> in the <e2>cake</e2> also
make it necessary to have a very sharp knife to cut
through the cake neatly." WordNet(e1) = "n2",
WordNet(e2) = "n3", Relation(e1, e2) = "content".

"Summer was over and he knew that the <e1>climate</e1>
in the <e2>forest</e2> would only get worse."
WordNet(e1) = "n1", WordNet(e2) = "n1", Relation(e1,
e2) != "content".

The negative example above would be classified as
location in the scheme of Nastase and Szpakowicz
(2003). The use of heuristic patterns to search for
positive and negative training examples should
naturally result in negative examples that are near
misses. We believe that near misses are more useful
for supervised learning than negative examples that
are generated purely randomly.

Each example will be independently labeled by two
annotators. When the annotation is completed, the
annotators will compare their labels and make a note
of the number of cases in which they agree and
disagree. If the annotators cannot come to a consensus
on the correct labels for a particular example, that
example will not be included in the training and
testing data, although it will be recorded for
possible future analysis.

This method of generating training and testing data is
designed with relational search in mind (Section 2). A
natural approach to relational search is to use
heuristic patterns (Hearst, 1992) with a conventional
search engine, and then use supervised learning
(Moldovan et al., 2004) to filter the resulting noisy
text.

We will follow the model of the Senseval-3 English
Lexical Sample Task, which had about 140 training and
70 testing samples per word. The following list
summarizes the main features of our dataset:

    * 7 semantic relations (not exhaustive and
possibly overlapping)
    * 140 training sentences per relation (7 × 140 =
980 training sentences)
    * 70 testing sentences per relation (7 × 70 = 490
testing sentences)
    * 210 combined testing and training sentences per
relation (7 × 210 = 1,470 sentences)
    * sentence classes will be approximately 50%
positive and 50% negative (roughly 735 positive and
735 negative, for a total of 1,470 sentences)
    * several different search patterns will be used
for each semantic relation, to avoid biasing the
sample sentences
    * negative examples of a relation will be "near
misses"

Since most of the words in the Senseval-3 English
Lexical Sample Task had more than two senses, we will
have more samples per class (positive and negative)
per relation than the average word in the English
Lexical Sample Task had samples per sense per word.

For each relation, one person will retrieve sample
sentences and two other people will annotate the
sentences. To encourage debate, the three people will
be chosen from three different institutions. A
detailed guide will be prepared, to maximize the
agreement between annotators.

5.Evaluation Methodology

As with the Senseval-3 Lexical Sample tasks, each team
participating in this task will initially have access
only to the training data. Later, the teams will have
access to unlabeled testing data (that is, there will
be WordNet labels, but no Relation labels). The teams
will enter their algorithms' guesses for the labels
for the testing data. When SemEval-1 is over, the
labels for the testing data will be released to the
public.

Algorithms will be allowed to skip examples that they
cannot classify. An algorithm's score for a given
relation will be the F score, the harmonic mean of
precision and recall. Algorithms will be ranked
according to their average F scores for the chosen set
of relations. We will also analyze the results to see
which relations are most difficult to classify. To
assess the effect of varying quantities of training
data, we will ask the teams to submit several sets of
guesses for the labels for the testing data, using
varying fractions of the training data.

Some algorithms (e.g., corpus-based algorithms) may
have no use for WordNet annotations. It might also be
argued that the WordNet annotation is not practical in
a real application. Therefore we will ask teams to
indicate, when they submit their answers, whether
their algorithms used the WordNet labels. We will
group the submitted answers into those that used the
WordNet labels and those that did not, and we will
rank the answers in each group separately. Teams will
be allowed to submit both types of answers, if their
algorithms permit it.

6.Copyright

Our collected training and testing data, including all
annotation, will be released under a Creative Commons
License.

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com