[Corpora-List] SemEval-2007 -- Task #11: English Lexical Sample Task via English-Chinese Parallel Text
Ng Hwee Tou
dcsnght at nus.edu.sg
Sat Nov 18 16:53:31 UTC 2006
Task #11: English Lexical Sample Task via English-Chinese Parallel Text
Updated on Nov 15, 2006 (** NEW **)
Call for Interest in Participation
http://www.comp.nus.edu.sg/~chanys/SemEval-2007.htm
http://nlp.cs.swarthmore.edu/semeval/interest.shtml
Feedback requested by Dec 1, 2006
Organizers
Hwee Tou Ng and Yee Seng Chan
National University of Singapore
Summary
We propose an English lexical sample task for word sense
disambiguation (WSD), where the sense-annotated examples are
(semi)-automatically gathered from word-aligned English-Chinese
parallel texts. After assigning appropriate Chinese translations to
each sense of an English word, the English side of the parallel texts
can then serve as the training data, as they are considered to have
been disambiguated and "sense-tagged" by the appropriate Chinese
translations.
For more details, please refer to the full description for this task
and the references given.
Full Description
First, English-Chinese parallel texts are automatically
word-aligned. Then the correct Chinese translations corresponding to
the different WordNet 1.7.1 senses of an English word are manually
selected. Finally, the English half of the parallel texts (the
ambiguous English word and its 3-sentence contexts) are used as the
training and test material to set up an English lexical sample task.
Since more than one English word sense may be translated by the same
Chinese word, two or more English senses s1, s2, ..., sk may be
collapsed into one sense in such cases. This gives rise to a lumped
sense (coarser-grained) evaluation.
We found from our past work that such an approach of acquiring
training examples can yield sense-tagged data of high quality (at
least as good as the quality of sense-tagged data for nouns collected
in Senseval3 English lexical sample task).
This proposed task is thus similar to the multilingual lexical sample
task in Senseval3, except that the training and test examples are
collected without manually annotating each individual ambiguous word
occurrence.
Datasets and Formats (** NEW **)
We have two tracks for this task, each track using a different
corpus. The first corpus is the following English-Chinese parallel
corpus available from the Linguistic Data Consortium (LDC):
LDC2005T10 Chinese English News Magazine Parallel Text
It will be used for the evaluation of 50 English words (25 nouns and
25 adjectives). Participants taking part in this track will need to
have access to the above LDC corpus in order to access the training
and test material in this track. Institutions that are LDC members can
obtain the corpus by paying US$150. Institutions that are non-LDC
members can obtain the corpus by paying US$2,000.
Since not all interested participants may have access to the above LDC
corpus, the second track of this task will make use of English-Chinese
documents gathered from the URL pairs given by the STRAND Bilingual
Databases. STRAND is a system that acquires document pairs in parallel
translation automatically from the Web. We will be using this corpus
for the evaluation of 40 English words (20 nouns and 20 adjectives).
Participants in this task can choose to participate in one or both
tracks.
Evaluation
The scorer will be the standard Senseval scorer.
Download area
This section will contain evaluation software, useful scripts,
complementary materials, baseline systems, etc. but not the datasets
proper. The datasets will be available at the main site for download.
Systems and Results
This section will be completed after the competition.
References
Chan, Yee Seng & Ng, Hwee Tou (2005). Scaling Up Word Sense
Disambiguation via Parallel Texts. Proceedings of the 20th National
Conference on Artificial Intelligence (AAAI
2005). (pp. 1037-1042). Pittsburgh, Pennsylvania, USA.
Ng, Hwee Tou, & Wang, Bin, & Chan, Yee Seng (2003). Exploiting
Parallel Texts for Word Sense Disambiguation: An Empirical
Study. Proceedings of the 41st Annual Meeting of the Association for
Computational Linguistics (ACL-03). (pp. 455-462). Sapporo, Japan.
Resnik, Philip & Smith, Noah A (2003). The Web as a Parallel
Corpus. Computational Linguistics, Volume 29, Issue 3 (pp. 349-380).
More information about the Corpora
mailing list