Corpora: ELRA News

Valerie Mapelli mapelli at elda.fr
Mon Jul 31 13:22:11 UTC 2000


[ We apologise for the duplicate posting of this announcement ]
___________________________________________________________
				ELRA
		European Language Resources Association
			       ELRA News 
___________________________________________________________

		     *** ELRA NEW RESOURCES ***

We are happy to announce new resources available via ELRA:

ELRA-S0034 New Verbmobil databases
ELRA-S0084 SALA Spanish Colombian Database
ELRA-L0042 PAROLE Spanish lexicon

A description of each database is given below.

_______________________________________
ELRA-S0034 Verbmobil 
_______________________________________

This resource consists of spontaneous speech recorded in 
a dialog task (appointment scheduling). The BAS edition of 
the German part is fully labelled and segmented into 
phonemic/phonetic SAM-PA by the MAUS system and partly 
segmented manually.
New corpora available via ELRA (for the complete list, please 
contact ELRA or visit ELRA or BAS Web sites):
VM CD 30.1 - VM30.1 (BAS edition)
Verbmobil II - German, 58 spontaneous dialogues (33 close mic, 
0 room mic, 25 phone line (GSM) recordings), 3024 turns, 
transliteration (Verbmobil II Format)  

VM CD 31.1 - VM31.1 (BAS edition)
Verbmobil II - American English, 32 spontaneous dialogues 
(32 close mic, 0 room mic, 0 phone line (GSM) recordings), 
2512 turns, transliteration (Verbmobil II Format)  

VM CD 32.1 - VM32.1 (BAS edition)
Verbmobil II - Multilingual, 17 spontaneous dialogues (17 
close mic, 0 room mic, 0 phone line (GSM) recordings), 
992 turns, transliteration (Verbmobil II Format)

_______________________________________
ELRA-S0084 SALA Spanish Colombian Database
_______________________________________

The SALA Spanish Colombian Database comprises 1000 
Colombian speakers (475 males, 525 females) recorded 
over the Colombian fixed telephone network. Corpus design, 
recruiting of speakers, annotation and formatting was done by 
the Universitat Politècnica de Catalunya (UPC). Collection was 
performed at Siemens Colombia.. Six speakers repeated the 
same prompt sheet in different calls. This database is 
partitioned into 4 CDs, each of which comprises 300 speakers 
sessions (except for CD 4, with 100 speakers sessions). The 
speech databases made within the SALA project were 
validated by SPEX, the Netherlands, to assess their 
compliance with the SALA format and content specifications.

The speech files are stored as sequences of 8-bit, 8kHz A-law 
speech files and are not compressed, according to the 
specifications of SALA. Each prompt utterance is stored within 
a separate file and has an accompanying ASCII SAM label file.

Corpus contents: 
· 6 application words; 
· 1 sequence of 10 isolated digits; 
· 4 connected digits: 1 sheet number (6 digits), 1 telephone 
number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN 
code (6 digits); 
· 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date 
(word style), 1 relative and general date expression; 
· 1 spotting phrase using an application word (embedded); 
· 1 isolated digit; 
· 3 spelled-out words (letter sequences): 1 spelling of surname; 
1 spelling of directory assistance city name; 1 real/artificial 
name for coverage; 
· 1 currency money amount; 
· 1 natural number; 
· 5 directory assistance names: 1 surname (out of 500); 1 city 
of birth / growing up (spontaneous); 1 most frequent city (out of 
500); 1 most frequent company/agency (out of 500); 1 "forename 
surname" (set of 150 )
· 2 questions, including "fuzzy" yes/no: 1 predominantly "yes" 
question, 1 predominantly "no" question; 
· 9 phonetically rich sentences; 
· 2 time phrases: 1 time of day (spontaneous), 1 time phrase 
(word style); 
· 4 phonetically rich words.

The following age distribution has been obtained: 11 speakers 
are below 16 years old, 486 speakers are between 16 and 30, 
305 speakers are between  31 and 45, 163 speakers are between 
46 and 60, and 35 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in SAMPA 
is also included.

_______________________________________
ELRA-L0042 PAROLE Spanish lexicon
_______________________________________

The PAROLE Spanish lexicon follows standard PAROLE 
architecture which includes morphological and syntactic layers. 
It includes the most frequent words found in a 1 million word 
corpus, coded according to the PAROLE specifications.

The lexicon contains about 22,000 morphological units, of which 
12,209 are common nouns, 3,367 verbs, 4,996 adjectives. Closed 
classed categories are fully covered.

The information associated with each morphological unit concerns 
part-of-speech and subtype, inflection paradigm (with 
morphosyntactic information for the endings organised in about 
132 models), possible stems in relation with the relevant endings, 
linking with syntactic layer. In the syntactic layer, information 
regarding subcategorisation for verbs and insertion context for 
nouns is encoded following the PAROLE model.

=====================================
For further information, please contact:

     ELRA/ELDA	               Tel  +33 01 43 13 33 33
     55-57 rue Brillat-Savarin         Fax  +33 01 43 13 33 30
     F-75013 Paris, France           E-mail  mapelli at elda.fr

or visit the online catalogue on our Web site:

     http://www.icp.grenet.fr/ELRA/home.html
     or http://www.elda.fr
===================================== 



More information about the Corpora mailing list