[Corpora-List] Availability of Resources for Concept-based Cross-Lingual Information Retrieval

Paul Buitelaar paulb at dfki.de
Thu Oct 2 13:51:17 UTC 2003


Dear colleague, evaluation resources that were developed within the
EU/NSF funded project MuchMore on Concept-based Cross-Lingual
Information Retrieval in the Medical Domain are now freely available
from the project web site at:

http://muchmore.dfki.de/resources_index.htm

Available resources include: a German - English, parallel medical
document collection, corresponding queries and relevance assessments,
evaluation sets of disambiguated terms and evaluation lists for
morphological decomposition of medical terms (German).

The project developed a cross-lingual information retrieval system that
enables users to retrieve documents in English and/or German, given a
query document in English or German. In the current version of the
system, query documents are assumed to be German electronic patient
records and documents to be retrieved are medical scientific abstracts
in both German and English. The cross-lingual information retrieval task
has been approached through a mix of methods: semantic annotation,
similarity thesaurus, example-based translation, pseudo relevance
feedback and vector-space model. Along these lines, three retrieval
systems have been developed that were integrated into a meta-search
engine with a common user interface (including an extensive query
construction functionality) and results presentation (including an
interactive, multidocument summarization functionality).

The MuchMore prototype (as well as the individual retrieval systems and
some additional demos on semantic annotation and term clustering) is
available at:

http://muchmore.dfki.de/demos_index_new.htm

At the core of the MuchMore project has been a comparative evaluation of
the different approaches used for the cross-lingual information
retrieval task. Overall results show that best preformance may be
obtained by a combination of corpus-based and concept-based information,
i.e. using a combination of manually constructed and automatically
extracted (semantic) resources. Adding manually constructed knowledge
(through semantic annotation or classification) improves performance,
although disambiguation has not been shown to further improve
performance significantly.

All results are available as project reports and/or as published papers at:

http://muchmore.dfki.de/pub.htm

Please contact us in case of further questions. Thanks for your time,


   Paul Buitelaar

   Coordinator MuchMore
   DFKI - Language Technology
   Saarbruecken, Germany

   http://muchmore.dfki.de/
   http://dfki.de/~paulb/



More information about the Corpora mailing list