Appel: Shared task on the 'lexical access problem', CogALex-IV Workshop

Wed Mar 26 21:16:16 UTC 2014

Date: Tue, 25 Mar 2014 22:48:41 +0800
From: zock <zock at free.fr>
Message-ID: <53319749.90508 at free.fr>
X-url: http://wacky.sslmit.unibo.it/doku.php?id=corpora
X-url: http://pageperso.lif.univ-mrs.fr/~michael.zock/CogALex-IV/cogalex-webpage/index.html

SHARED TASK ON THE LEXICAL ACCESS PROBLEM (COMPUTING ASSOCIATIONS WHEN
GIVEN MULTIPLE STIMULI)

In the framework of the 4th Workshop on Cognitive Aspects of the Lexicon
(CogALex) to be held at COLING 2014, we invite participation in a shared
task devoted to the problem of lexical access in language production,
with the aim of providing a quantitative comparison between different
systems.

MOTIVATION

The quality of a dictionary depends not only on coverage, but also on
the accessibility of the information. That is a crucial point is
dictionary access.  Access strategies vary with the task (text
understanding vs. text production) and the knowledge available at the
very moment of consultation (words, concepts, speech sounds). Unlike
readers who look for meanings, writers start from them, searching for
the corresponding words. While paper dictionaries are static, permitting
only limited strategies for accessing information, their electronic
counterparts promise dynamic, proactive search via multiple criteria
(meaning, sound, related words) and via diverse access
routes. Navigation takes place in a huge conceptual lexical space, and
the results are displayable in a multitude of forms (e.g. as trees, as
lists, as graphs, or sorted alphabetically, by topic, by frequency).

To bring some structure into this multitude of possibilities, the shared
task will concentrate on a crucial subtask, namely multiword
association.  What we mean by this in the context of this workshop is
the following. Suppose, we were looking for a word expressing the
following ideas: 'superior dark coffee made of beans from Arabia', but
could not remember the intended word 'mocha' due to the
tip-of-the-tongue problem. Since people always remember something
concerning the elusive word, it would be nice to have a system accepting
this kind of input, to propose then a number of candidates for the
target word.  Given the above example, we might enter 'dark', 'coffee',
'beans', and 'Arabia', and the system would be supposed to come up with
one or several associated words such as 'mocha', 'espresso', or
'cappuccino'.

TASK DEFINITION

The participants will receive lists of five given words (primes) such as
'circus', 'funny', 'nose', 'fool', and 'fun' and are supposed to compute
the word which is most closely associated to all of them. In this case,
the word 'clown' would be the expected response. Here are some more
examples:

 given words: gin, drink, scotch, bottle, soda
 target word: whisky

 given words: wheel, driver, bus, drive, lorry
 target word: car

 given words: neck, animal, zoo, long, tall
 target word: giraffe

 given words: holiday, work, sun, summer, abroad
 target word: vacation

 given words: home, garden, door, boat, chimney
 target word: house

 given words: blue, cloud, stars, night, high
 target word: sky

We will provide a training set of 2000 sets of five input words
(multiword stimuli), together with the expected target words
(associative responses). The participants will have about five weeks to
train their systems on this data.  After the training phase, we will
release a test set containing another 2000 sets of five input words, but
without providing the expected target words.

Participants will have five days to run their systems on the test data,
thereby predicting the target words. For each system, we will compare
the results to the expected target words and compute an accuracy. The
participants will be invited to submit a paper describing their approach
and their results.

For the participating systems, we will distinguish two categories:

(1) Unrestricted systems. They can use any kind of data to compute their
    results.

(2) Restricted systems: These systems are only allowed to draw on the
    freely available ukWaC corpus in order to extract information on
    word associations.  The ukWaC corpus comprises about 2 billion words
    and is can be downloaded from
    http://wacky.sslmit.unibo.it/doku.php?id=corpora.

Participants are allowed to compete in either category or in both.

VENUE

The shared task will take place as part of the CogALex workshop which is
co-located with COLING 2014 (Dublin). The workshop date is August 23,
2014.  Shared task participants who wish to have a paper published in
the workshop proceedings will be required to present their work at the
workshop.

SHARED TASK SCHEDULE

Training data release: March 27, 2014
Test data release: May 5, 2014
Final results due: May 9, 2014
Deadline for paper submission: May 25, 2014
Reviewers' feedback: June, 15, 2014
Camera-ready version: July 7, 2014
Workshop date: August 23, 2014

FURTHER INFORMATION

CogALex workshop website: http://pageperso.lif.univ-mrs.fr/~michael.zock/CogALex-IV/cogalex-webpage/index.html
Data releases: To be found on the above workshop website from the dates
given in the schedule.
Registration for the shared task: Send e-mail to Michael Zock, with
Reinhard Rapp in copy.

WORKSHOP ORGANIZERS

Michael Zock (LIF-CNRS, Marseille, France), michael.zock AT
lif.univ-mrs.fr
Reinhard Rapp (University of Aix Marseille (France) and Mainz (Germany),
reinhardrapp AT gmx.de
Chu-Ren Huang (The Hong Kong Polytechnic University, Hong Kong),
churen.huang AT inet.polyu.edu.hk

-------------------------------------------------------------------------
Message diffuse par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.atala.org/article.php3?id_article=48
English version       : 
Archives                 : http://listserv.linguistlist.org/archives/ln.html
                                http://liste.cines.fr/info/ln

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  : http://www.atala.org/

ATALA décline toute responsabilité concernant le contenu des
messages diffusés sur la liste LN
-------------------------------------------------------------------------