Corpora: ELRA News

Fri Feb 1 16:17:19 UTC 2002

[Our apologies if you receive multiple copies of this announcement]

************************************************************
ELRA - European Language Resources Association
************************************************************

We are pleased to announce some new resources
available in our catalogue of language resources:

S0119 Spanish SpeechDat Database for the Mobile Telephone Network
W0032 Modern French Corpus including Anaphors Tagging
W0033 CRATER 2

A short description of these three new resources is given
below. Please visit the online catalogue to get further details:
http://www.elda.fr/catalog.html

S0119 Spanish SpeechDat Database for the Mobile Telephone Network
***********************************************************************************
The Spanish SpeechDat database for the mobile telephone network
comprises 1066 Spanish speakers (526 males, 540 females) calling
from GSM telephones and recorded over the fixed PSTN using and
ISDN-BRI interface. The database was produced by Applied Technologies
in Language and Speech S.L. (Spain). The MDB-1000 database is
partitioned into 6 CDs in ISO 9660 format. This database follows the
specifications given in the framework of the SpeechDat(II) project.
Speech samples are stored as sequences of 8-bit 8 kHz A-law.
Each prompted utterance is stored in a separate file. Each signal file
is accompanied by an ASCII SAM label file which contains the relevant
descriptive information.
Each speaker uttered the following items:
·       2 isolated digits.
·       1 sequence of 10 isolated digits.
·       4 connected digits: 1 sheet number (6 digits), 1 telephone number
(9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits).
·       3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date
(word style), 1 relative and general date expression.
·       1 word spotting phrase using an application word (embedded).
·       6 application words.
·       3 spelled words: 1 spontaneous name (own forename), 1 city
name, 1 real / artificial word for coverage.
·       1 currency money amount.
·       1 natural number.
·       6 directory assistance names: 1 surname (set of 500), 1 city of
birth / growing up, 1 most frequent cities (set of 500), 1 most frequent
company / agency (set of 500), 1 ‘forename surname’ (set of 150), 1
spontaneous forename.
·       2 questions including ‘fuzzy’ yes / no: 1 predominantly ‘Yes’ question,
1 predominantly ‘No’ question.
·       9 phonetically rich sentences.
·       2 time phrases: 1 time of day (spontaneous), 1 time phrase (word 
style).
·       4 phonetically rich words.
·       Call environment.
The following age distribution has been obtained: 5 speaker are below 16
years old, 543 speakers are between 16 and 30, 307 speakers are
between 31 and 45, 202 speakers are between 46 and 60, 9 speakers are
over 60. A pronunciation lexicon with a phonemic transcription in SAMPA is
also included.

W0032 Modern French Corpus including Anaphors Tagging
********************************************************************
The corpus that includes the tagging of the anaphors was created by
the CRISTAL-GRESEC (Stendhal-Grenoble 3 University, France) team
and XRCE (Xerox Research Centre Europe, France) in the framework of
the call launched by the DGLF-LF (national institution for the French
language and the languages spoken in France), for the creation of modern
French corpora).
Over 1 million words have been annotated. The corpora have been selected
so that they represent a wide sampling of the French language (scientific
and human science articles, extracts from newspapers and magazines,
legal texts, etc.) and according to the points of interest of the teams working
on the project. The processed corpora supplied by ELRA are listed below:
-       Two books edited by the CNRS: La protection des oeuvres scientifiques
en droit d'auteur français, Xavier Strubel. Paris, CNRS Editions, 1997 (77 591
words) and Cinquante ans de traction à la SNCF. Enjeux politiques, économiques
et réponses techniques, Clive Lamming. Paris, CNRS Editions, 1997 (124 990
words).
-       204 articles extracted from CNRS Info, a magazine which contains short
popular scientific articles from the CNRS laboratories (201 280 words).
-       14 articles dealing with Hermès Human Sciences (111 886 words).
-       136 articles extracted from "Le Monde", dealing with economics (roughly
180 760 words).
-       13 booklets of the Official Journal of the European Communities 
(roughly
337 000 words).

Below the tagged anaphoric elements:
-       Person pronouns: 3rd person pronoun, anaphoric.
-       Possessive determiners: 3rd person possessive determiner.
-       Demonstrative pronouns: anaphoric pronouns (celui, celle, ceux, 
celles-ci,
celles-là)
-       Indefinite pronouns: Aucun(e), chacun(e), certain(e)s, l'un(e), les 
un(e)s,
tout(es), etc, when they are anaphoric.
-       "Proverbs": "le" + "faire".
-       Anaphoric and cataphoric adverbs: Dessus, dedans, dessous , when
they have an anaphoric function.
-       Ellipsis of head nouns: Nominal adjectives or quantifiers determiners
ellipsis.
-       Textual headers like "ce dernier": Ce dernier, le premier , etc.
The annotation scheme was defined in XML format. The texts were divided
into sections, paragraphs (<p>) and sentences (<s>). The sentence
segmentation was carried out with
NLP tools developed by XRCE, the annotation part was done manually by two
qualified linguists. A large subset of anaphoric phrases was automatically
pre-annotated. The antecedents and the tagging of the anaphoric relations
were manually processed, but editing tools (emacs, macros from Author/Editor
software) were used to make it easier. 5% of the corpora were evaluated to 
check
the annotation reliability.

W0033 CRATER 2
**********************
The CRATER corpus was built upon the foundations of an earlier project,
ET10/63, which was funded in the final phase of the Eurotra programme.
The Corpus Resources and Terminology Extraction project (MLAP-93 20)
extended the bilingual annotated English-French International 
Telecommunications
Union corpus produced within ET10/63 to include Spanish.
The CRATER 2 corpus was produced by the Department of Linguistics & Modern
English Language, Lancaster University (United Kingdom) with funding from
ELRA. The ELRA funding in turn was provided by the European Commission
project LRsP&P (Language Resources Production & Packaging - LE4-8335).
This project has enhanced the CRATER corpus, available under the reference
ELRA-W0003 in the ELRA catalogue. CRATER 2 has significantly expanded
the French/English component of the parallel corpus by increasing the size
of the English/French corpus from 1,000,000 words per language to
approximately 1,500,000 tokens per language. CRATER 2 is sold with CRATER
in a single package.

=====================================
For further information, please contact:

ELRA/ELDA
55-57 rue Brillat-Savarin
F-75013 Paris, France

Tel: +33 01 43 13 33 33
Fax: +33 01 43 13 33 30

E-mail mapelli at elda.fr

or visit our Web site:
http://www.icp.grenet.fr/ELRA/home.html
or http://www.elda.fr
=====================================