[Corpora-List] Availability of Resources for Concept-based Cross-Lingual Information Retrieval
Paul Buitelaar
paulb at dfki.de
Thu Oct 2 13:51:17 UTC 2003
Dear colleague, evaluation resources that were developed within the
EU/NSF funded project MuchMore on Concept-based Cross-Lingual
Information Retrieval in the Medical Domain are now freely available
from the project web site at:
http://muchmore.dfki.de/resources_index.htm
Available resources include: a German - English, parallel medical
document collection, corresponding queries and relevance assessments,
evaluation sets of disambiguated terms and evaluation lists for
morphological decomposition of medical terms (German).
The project developed a cross-lingual information retrieval system that
enables users to retrieve documents in English and/or German, given a
query document in English or German. In the current version of the
system, query documents are assumed to be German electronic patient
records and documents to be retrieved are medical scientific abstracts
in both German and English. The cross-lingual information retrieval task
has been approached through a mix of methods: semantic annotation,
similarity thesaurus, example-based translation, pseudo relevance
feedback and vector-space model. Along these lines, three retrieval
systems have been developed that were integrated into a meta-search
engine with a common user interface (including an extensive query
construction functionality) and results presentation (including an
interactive, multidocument summarization functionality).
The MuchMore prototype (as well as the individual retrieval systems and
some additional demos on semantic annotation and term clustering) is
available at:
http://muchmore.dfki.de/demos_index_new.htm
At the core of the MuchMore project has been a comparative evaluation of
the different approaches used for the cross-lingual information
retrieval task. Overall results show that best preformance may be
obtained by a combination of corpus-based and concept-based information,
i.e. using a combination of manually constructed and automatically
extracted (semantic) resources. Adding manually constructed knowledge
(through semantic annotation or classification) improves performance,
although disambiguation has not been shown to further improve
performance significantly.
All results are available as project reports and/or as published papers at:
http://muchmore.dfki.de/pub.htm
Please contact us in case of further questions. Thanks for your time,
Paul Buitelaar
Coordinator MuchMore
DFKI - Language Technology
Saarbruecken, Germany
http://muchmore.dfki.de/
http://dfki.de/~paulb/
More information about the Corpora
mailing list