Corpora: Cooperation needed to delevelop Dutch IR test collection
Djoerd Hiemstra
hiemstra at cs.utwente.nl
Tue May 9 15:42:18 UTC 2000
APOLOGIES for multiple copies of this message.
Dear NLP/IR - researcher,
We are planning to set up a TREC style information retrieval (IR) test
collection with Dutch data. The collection will consist of newspaper
articles, 40 queries and relevance judgements of real users, and will
be available as a benchmark for future evaluations of Dutch NLP and IR
techniques for tasks like retrieval or filtering. TREC style IR
collection are created by judging only a part (the pool) of the
document collection for relevance (see: http://trec.nist.gov/ for more
information). Essential for creating a reasonable IR test collection
is therefore that many different IR systems, following different
approaches to IR contribute to this document pool. Your cooperation is
needed to make this a success. We plan to include this evaluation in
the Cross-Language Evaluation Forum
(CLEF, http://galileo.iei.pi.cnr.it/DELOS/CLEF/). CLEF is the
follow-up of the successful Cross-Language track of TREC and will
start of this year with a document collection consisting of French,
English, German and Italian documents. The CLEF organisation intends
to extend the collection next year with new languages. We hope Dutch
will be one of those. If you are interested in research on a Dutch
collection we encourage you to participate, thereby improving the
quality of the test collection. Next year's evaluation will start in
May 2001 and results have to be submitted in July 2001. Participation
in the monolingual Dutch task is relatively simple and could be done
by students as a design project.
Chances for success are heavily dependent on the number of groups
interested in Dutch. The following people already informally expressed
their interest in working with a Dutch test collection:
- Keith van Rijsbergen (University of Glasgow)
- Arjen de Vries (CWI, Amsterdam)
- Wessel Kraaij (TNO-TPD, Delft)
- Djoerd Hiemstra (University of Twente)
Expressing your interest at this point will not commit you to anything,
but it will help us in showing that there is enough interest from research
institutes and companies in Dutch as a language to develop and evaluate IR
systems for. Also, by expressing your interest, will we keep you informed
on any new developments.
Note: For this year's CLEF (2000), Dutch translations of the topics
will be included in the official topics release of 8 May 2000. So,
already for this year, participants are able to study basic problems
with handling Dutch (like compound analysis) and use or develop
resources like stop lists, stemmers, taggers, parsers, translation
dictionaries, etc. as a preparation for CLEF 2001. We encourage
interested groups to participate already this year.
The easiest way to work on Dutch in CLEF 2000 is by doing a
bilingual task, e.g. using Dutch queries to retrieve English
documents, but the full task (i.e. Dutch --> X) is also possible. If
necessary, we are willing to provide and point out resources and/or
software to groups that are lacking those. For more information,
please contact the people below.
Best regards,
Wessel Kraaij Djoerd Hiemstra
<kraaij at tpd.tno.nl> <hiemstra at cs.utwente.nl>
TNO-TPD University of Twente
More information about the Corpora
mailing list