Corpora: ELRA News
Valerie Mapelli
mapelli at elda.fr
Mon Jul 31 13:22:11 UTC 2000
[ We apologise for the duplicate posting of this announcement ]
___________________________________________________________
ELRA
European Language Resources Association
ELRA News
___________________________________________________________
*** ELRA NEW RESOURCES ***
We are happy to announce new resources available via ELRA:
ELRA-S0034 New Verbmobil databases
ELRA-S0084 SALA Spanish Colombian Database
ELRA-L0042 PAROLE Spanish lexicon
A description of each database is given below.
_______________________________________
ELRA-S0034 Verbmobil
_______________________________________
This resource consists of spontaneous speech recorded in
a dialog task (appointment scheduling). The BAS edition of
the German part is fully labelled and segmented into
phonemic/phonetic SAM-PA by the MAUS system and partly
segmented manually.
New corpora available via ELRA (for the complete list, please
contact ELRA or visit ELRA or BAS Web sites):
VM CD 30.1 - VM30.1 (BAS edition)
Verbmobil II - German, 58 spontaneous dialogues (33 close mic,
0 room mic, 25 phone line (GSM) recordings), 3024 turns,
transliteration (Verbmobil II Format)
VM CD 31.1 - VM31.1 (BAS edition)
Verbmobil II - American English, 32 spontaneous dialogues
(32 close mic, 0 room mic, 0 phone line (GSM) recordings),
2512 turns, transliteration (Verbmobil II Format)
VM CD 32.1 - VM32.1 (BAS edition)
Verbmobil II - Multilingual, 17 spontaneous dialogues (17
close mic, 0 room mic, 0 phone line (GSM) recordings),
992 turns, transliteration (Verbmobil II Format)
_______________________________________
ELRA-S0084 SALA Spanish Colombian Database
_______________________________________
The SALA Spanish Colombian Database comprises 1000
Colombian speakers (475 males, 525 females) recorded
over the Colombian fixed telephone network. Corpus design,
recruiting of speakers, annotation and formatting was done by
the Universitat Politècnica de Catalunya (UPC). Collection was
performed at Siemens Colombia.. Six speakers repeated the
same prompt sheet in different calls. This database is
partitioned into 4 CDs, each of which comprises 300 speakers
sessions (except for CD 4, with 100 speakers sessions). The
speech databases made within the SALA project were
validated by SPEX, the Netherlands, to assess their
compliance with the SALA format and content specifications.
The speech files are stored as sequences of 8-bit, 8kHz A-law
speech files and are not compressed, according to the
specifications of SALA. Each prompt utterance is stored within
a separate file and has an accompanying ASCII SAM label file.
Corpus contents:
· 6 application words;
· 1 sequence of 10 isolated digits;
· 4 connected digits: 1 sheet number (6 digits), 1 telephone
number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN
code (6 digits);
· 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date
(word style), 1 relative and general date expression;
· 1 spotting phrase using an application word (embedded);
· 1 isolated digit;
· 3 spelled-out words (letter sequences): 1 spelling of surname;
1 spelling of directory assistance city name; 1 real/artificial
name for coverage;
· 1 currency money amount;
· 1 natural number;
· 5 directory assistance names: 1 surname (out of 500); 1 city
of birth / growing up (spontaneous); 1 most frequent city (out of
500); 1 most frequent company/agency (out of 500); 1 "forename
surname" (set of 150 )
· 2 questions, including "fuzzy" yes/no: 1 predominantly "yes"
question, 1 predominantly "no" question;
· 9 phonetically rich sentences;
· 2 time phrases: 1 time of day (spontaneous), 1 time phrase
(word style);
· 4 phonetically rich words.
The following age distribution has been obtained: 11 speakers
are below 16 years old, 486 speakers are between 16 and 30,
305 speakers are between 31 and 45, 163 speakers are between
46 and 60, and 35 speakers are over 60.
A pronunciation lexicon with a phonemic transcription in SAMPA
is also included.
_______________________________________
ELRA-L0042 PAROLE Spanish lexicon
_______________________________________
The PAROLE Spanish lexicon follows standard PAROLE
architecture which includes morphological and syntactic layers.
It includes the most frequent words found in a 1 million word
corpus, coded according to the PAROLE specifications.
The lexicon contains about 22,000 morphological units, of which
12,209 are common nouns, 3,367 verbs, 4,996 adjectives. Closed
classed categories are fully covered.
The information associated with each morphological unit concerns
part-of-speech and subtype, inflection paradigm (with
morphosyntactic information for the endings organised in about
132 models), possible stems in relation with the relevant endings,
linking with syntactic layer. In the syntactic layer, information
regarding subcategorisation for verbs and insertion context for
nouns is encoded following the PAROLE model.
=====================================
For further information, please contact:
ELRA/ELDA Tel +33 01 43 13 33 33
55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30
F-75013 Paris, France E-mail mapelli at elda.fr
or visit the online catalogue on our Web site:
http://www.icp.grenet.fr/ELRA/home.html
or http://www.elda.fr
=====================================
More information about the Corpora
mailing list