Corpora: ELRA news 2/2

Fri Oct 5 07:38:54 UTC 2001

[Our apologies if you receive multiple copies of this announcement]

************************************************************
ELRA - European Language Resources Association
************************************************************

A new resource is available in our catalogue of
Language Resources:

ELRA-W0029      Amaryllis Corpus

A description of this new resources is given
below:

Launched at the end of 1995, the AMARYLLIS project
aimed at evaluating information retrieval software for
French text corpora in order to provide a methodology
for the evaluation of other similar tools. AMARYLLIS
was organised by the Institut de l'Information Scientifique
et Technique (INIST) with the support of the Agence
francophone pour l'enseignement supérieur et la
recherche (AUPELF-UREF) and the French Ministère de
l'Education Nationale, de la Recherche et de la Technologie
(MERT).
More specifically, the objective was to create document
corpora, questions and answers, in the framework of the
Action de Recherche Concertée (ARC A1, renamed as
Amaryllis- Access to text information in French), in order
to get similar works to the United States project TREC.
For more information about the AMARYLLIS project,
please visit the following web site:
http://www.inist.fr/accueil/profran.htm

All corpora are structured as SGML files with isolatin character
-encoding.
The available corpora were provided by:
-       INIST (Institut de l'Information Scientifique et Technique)
-       OFIL (Observatoire Français et International des Industries de
la Langue)
-       ELRA (European Language Resources Association)

Each provider provided three types of corpora : text documents,
search topics and answers to these topics in the corresponding
text corpora (with frames of reference for the answers).

1- Text documents in French
The text documents in French comprise:
-       Articles (titles and texts) extracted from trhe newspaper
"Le Monde"; each batch contains three months of documents,
provided by OFIL (01-01-93/31-03-93, 01-04-93/30-06-93),
-       Titles and summaries of scientific articles covering every
domain from the Pascal bibliographical databases (from 1984
to 1995) and Francis (from 1992 to 1995), provided by INIST.
The tagging of the documents conforms to a simplified version
of a DTD from the TEI, which includes the possibility to manage
the logical structure.

2- Multilingual text documents
The multilingual text documents have been provided by ELRA,
and comprise documents in 6 languages (French, English,
Italian, Spanish, German and Portuguese), extracted from the
parallel corpus MLCC which contains documents translated in
official European languages (from 1992 to 1994). The corpus was
divided in two sub-corpora: written questions (10 million words)
and debates of the European Parliament (5 to 8 million de words
per language).

3- Search topics
The topics derive from questions asked by end users, and should
contain every information which is necessary to understand
the issue they deal with and to estimate the relevance. They comprise
the following items:
-       A domain, to determine the field of knowledge they belong to,
-       A topic: which equals to a title defining the subject,
-       A question: which matches the question the user may ask,
-       Complementary information: which gives details on further documents
that should be selected from the corpus,
-       Concepts: which are a set of descriptors used to set the limits of the
search.
The topics have been built by OFIL, by some documentalists working for
Le Monde who used requests from journalists, and by engineers responsible
for documentation at INIST (experts in their domain) who used requests from
end users. These topics were to cover numerous application fields, and to get
a large number of relevant results in each corpus. The topics have been tested
on the corpora to control their relevance. The query may have had to be 
modified,
or some further details may have been needed.

4- Frames of reference for the answers
Answers' files contain for each numbered topic the numbers of all relevant
documents. Some frames of reference for the answers were established before the
participants proceeded to the tests. The answers had been selected by the 
providers
(OFIL and INIST) with the appropriate methodology and adequate tools 
(initial frames
of reference): they proceeded to a pre-selection of documents as extended 
as possible,
based not only on their titles and summaries but also on the key words and 
classification
codes used in the Pascal and Francis databases. These key words and 
classification
codes can not be accessed by the participants. The results (a set of 
documents) are sorted manually, so that the results match the best the query.
The initial frames of reference were checked manually by the providers 
(INIST and OFIL),
using the answers given by the participants. These answers were collected 
when the tests
were finished. This allowed us to review and correct the frames of 
reference for the answers
in order to give some even more detailed information for their 
content.  The illustration below
shows how the review was performed.

The 4 CDs contain each a corpus for the two phases of the two campaigns 
which took place.
TrecEval is also provided.

=====================================
For further information, please contact:
ELRA/ELDA
55-57 rue Brillat-Savarin
F-75013 Paris, France
Tél. : +33 01 43 13 33 33
Fax : +33 01 43 13 33 30
Email: mapelli at elda.fr
or consult our catalogue at the following address:
http://www.icp.grenet.fr/ELRA/home.html
or http://www.elda.fr
=====================================