[Corpora-List] Call for participation: SemEval task #17 on WSD on specific domains

Tue Jan 19 17:50:54 UTC 2010

-- Apologies for multiple postings --

         Call for participation:

           SemEval task #17 on

         WSD on specific domains

Domain adaptation is a hot issue in Natural Language Processing,
including Word Sense Disambiguation. Word Sense Disambiguation systems
trained on general corpora are known to perform worse when moved to
specific domains. WSD-domain task will offer a testbed for
domain-specific WSD systems, and will allow to test domain portability
issues, in the framework of SemEval 2010 (http://semeval2.fbk.eu).

There is currently no all-words corpus available for specific domains.
Lexical-sample sense-tagged corpus do exist, but they only cover the
occurrences of a few manually selected words. The all-words corpus of
WSD-domain will allow to measure the performance of WSD systems
deployed in domain specific data in realistic conditions.

The WSD-domain task will produce sizeable all-words corpora on the
environment domain. Texts from ECNC and WWF will be used in order to
build domain specific test corpora. The data will be available in a
number of languages: English, Chinese, Dutch and Italian. The sense
inventories will be based on wordnets of the respective languages.

The test data will comprise three documents (6000 word chunk with
approx. 2000 target words) for each language. The test data will be
annotated by hand using double-blind annotation plus adjudication.
Inter-Tagger Agreement will be measured. There will not be training
data available, but participants are free to use existing hand-tagged
corpora and lexical resources (e.g. SemCor). Background text from
the domain will be provided for unsupervised or semi-supervised
learning.

Instructions for participation:
-------------------------------

At test time, the test documents will be tokenized (segmented in the
case of Chinese). And the organizers will clearly mark which are the
units (single word, multiword, component of compounds) that need to
be disambiguated. These will include nouns and verbs.

The sense inventory is based on WordNet and will include all words,
multiwords and senses to be used. Instructions to obtain the relevant
wordnets are available from the website.

The organizers also include some untagged background documents,
which the participants can use to tune their WSD algorithms.

Steps:
------

1. join the mailing list (http://groups.google.com/group/WSD-domain)
2. register in SemEval website (http://semeval2.fbk.eu)
3. download the trial data from SemEval website (http://semeval2.fbk.eu)
4. download background text from task website (http://xmlgroup.iit.cnr.it/SemEval2010)
5. get wordnets of respective languages (http://xmlgroup.iit.cnr.it/SemEval2010)
6. download test data from SemEval website (http://semeval2.fbk.eu)
7. upload results to SemEval website (http://semeval2.fbk.eu)

Important Dates:
----------------

- March 26: test data available
- April 2: deadline for submission of results

Contact:
--------

Eneko Agirre e.agirre at ehu.es

Organizers:
-----------

WSD-domain is being developed in the framework of the Kyoto project
http://www.kyoto-project.eu/.

Eneko Agirre and Oier Lopez de Lacalle
University of the Basque Country

Christiane Fellbaum
Princeton University

Chu-Ren Huang
The Hong Kokng Polytechnic University

Shu-Kai Hsieh
Academia Sinica

Andrea Marchetti
IIT, CNR

Monica Monachini
ILC, CNR

Piek Vossen
Vrije Universiteit Amsterdam

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora