Ressources: ELRA new resources

Alexis Nasr alexis.nasr at linguist.jussieu.fr
Mon Oct 8 17:28:47 UTC 2001


************************************************************
ELRA - European Language Resources Association
************************************************************

New resources are available in our catalogue of
Language Resources:

ELRA-S0114      Strange Corpus 10 - SC10 ('Accents II')
ELRA-S0115      SpeechDat-Car
ELRA-W0015      Text Corpus "Le Monde" / year 2000

A description of these three new resources is given
below:

ELRA-S0114      Strange Corpus 10 - SC10 ('Accents II')
A collection of a variety of speech styles, by native and
non-native German speakers (67 non-native, and 3 native
German speakers) who uttered read texts, digits,
phonetically-balanced sentences, stories, spontaneous
speech, dialogue. Transliteration, orthography, canonical
transcription are provided.

ELRA-S0115      SpeechDat-Car
The American English SpeechDat-Car database comprises
314 American English speakers (150 males and 164
females) recorded over the mobile phone network. Each
speaker uttered about 120 items (read and spontaneous).

ELRA-W0015      Text Corpus "Le Monde" / year 2000
The texts for the year 2000 have been added to the collection.

************************************************************
ELRA - European Language Resources Association
************************************************************

A new resource is available in our catalogue of
Language Resources:

ELRA-W0029      Amaryllis Corpus

A description of this new resources is given
below:

Launched at the end of 1995, the AMARYLLIS project
aimed at evaluating information retrieval software for
French text corpora in order to provide a methodology
for the evaluation of other similar tools. AMARYLLIS
was organised by the Institut de l'Information Scientifique
et Technique (INIST) with the support of the Agence
francophone pour l'enseignement supérieur et la
recherche (AUPELF-UREF) and the French Ministère de
l'Education Nationale, de la Recherche et de la Technologie
(MERT).
More specifically, the objective was to create document
corpora, questions and answers, in the framework of the
Action de Recherche Concertée (ARC A1, renamed as
Amaryllis- Access to text information in French), in order
to get similar works to the United States project TREC.
For more information about the AMARYLLIS project,
please visit the following web site:
http://www.inist.fr/accueil/profran.htm

All corpora are structured as SGML files with isolatin character
-encoding.
The available corpora were provided by:
-       INIST (Institut de l'Information Scientifique et Technique)
-       OFIL (Observatoire Français et International des Industries de
la Langue)
-       ELRA (European Language Resources Association)

Each provider provided three types of corpora : text documents,
search topics and answers to these topics in the corresponding
text corpora (with frames of reference for the answers).

1- Text documents in French
The text documents in French comprise:
-       Articles (titles and texts) extracted from trhe newspaper
"Le Monde"; each batch contains three months of documents,
provided by OFIL (01-01-93/31-03-93, 01-04-93/30-06-93),
-       Titles and summaries of scientific articles covering every
domain from the Pascal bibliographical databases (from 1984
to 1995) and Francis (from 1992 to 1995), provided by INIST.
The tagging of the documents conforms to a simplified version
of a DTD from the TEI, which includes the possibility to manage
the logical structure.

2- Multilingual text documents
The multilingual text documents have been provided by ELRA,
and comprise documents in 6 languages (French, English,
Italian, Spanish, German and Portuguese), extracted from the
parallel corpus MLCC which contains documents translated in
official European languages (from 1992 to 1994). The corpus was
divided in two sub-corpora: written questions (10 million words)
and debates of the European Parliament (5 to 8 million de words
per language).

3- Search topics
The topics derive from questions asked by end users, and should
contain every information which is necessary to understand
the issue they deal with and to estimate the relevance. They comprise
the following items:
-       A domain, to determine the field of knowledge they belong to,
-       A topic: which equals to a title defining the subject,
-       A question: which matches the question the user may ask,
-       Complementary information: which gives details on further
documents
that should be selected from the corpus,
-       Concepts: which are a set of descriptors used to set the limits
of the
search.
The topics have been built by OFIL, by some documentalists working for
Le Monde who used requests from journalists, and by engineers
responsible
for documentation at INIST (experts in their domain) who used requests
from
end users. These topics were to cover numerous application fields, and
to get
a large number of relevant results in each corpus. The topics have been
tested
on the corpora to control their relevance. The query may have had to be
modified,
or some further details may have been needed.

4- Frames of reference for the answers
Answers' files contain for each numbered topic the numbers of all
relevant
documents. Some frames of reference for the answers were established
before the
participants proceeded to the tests. The answers had been selected by
the
providers
(OFIL and INIST) with the appropriate methodology and adequate tools
(initial frames
of reference): they proceeded to a pre-selection of documents as
extended
as possible,
based not only on their titles and summaries but also on the key words
and
classification
codes used in the Pascal and Francis databases. These key words and
classification
codes can not be accessed by the participants. The results (a set of
documents) are sorted manually, so that the results match the best the
query.
The initial frames of reference were checked manually by the providers
(INIST and OFIL),
using the answers given by the participants. These answers were
collected
when the tests
were finished. This allowed us to review and correct the frames of
reference for the answers
in order to give some even more detailed information for their
content.  The illustration below
shows how the review was performed.

The 4 CDs contain each a corpus for the two phases of the two campaigns
which took place.
TrecEval is also provided.

=====================================
For further information, please contact:
ELRA/ELDA
55-57 rue Brillat-Savarin
F-75013 Paris, France
Tél. : +33 01 43 13 33 33
Fax : +33 01 43 13 33 30
Email: mapelli at elda.fr
or consult our catalogue at the following address:
http://www.icp.grenet.fr/ELRA/home.html
or http://www.elda.fr
=====================================
-------------------------------------------------------------------------
Message diffusé par la liste Langage Naturel <LN at cines.fr>
Informations, abonnement : http://www.biomath.jussieu.fr/LN/LN-F/
English version          : http://www.biomath.jussieu.fr/LN/LN/
Archives                 : http://listserv.linguistlist.org/archives/ln.html

La liste LN est parrainée par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhésion  : http://www.atala.org/
-------------------------------------------------------------------------



More information about the Ln mailing list