[Corpora-List] ELRA News 1/2

Magali Jeanmaire duclaux at elda.fr
Wed Apr 16 15:02:33 UTC 2003


****************************************************************
ELRA is happy to announce that new resources are
available in its catalogue of language resources
****************************************************************
You will find below the short descriptions of these new
resources. We invite you to visit the on-line catalogue
on our web site, at http://www.elda.fr or http://www.elra.info,
to get more detailed descriptions.

Please contact us if you would like to get more information.
****************************************************************
Spoken Language Resources:

- S0144 Italian SpeechDat-Car
- S0113 Spoken Dutch Corpus: release 6

AURORA Databases

- Subset of Italian SpeechDat-Car database (AURORA/CD0003-05)
- Aurora 4a and Aurora 4b databases

****************************************************************
*** S0144 Italian SpeechDat-Car ***
The Italian SpeechDat-Car database contains the recordings in a car
of 300 Italian speakers, who uttered around 120 read and spontaneous
items. Recordings have been made through 5 different channels, of which
4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones)
and 1 channel over the GSM network.

*** S0113 Spoken Dutch Corpus: Release 6 ***
Release 6 of the Spoken Dutch Corpus was published.
Sound files together with their orthographic transcripts are included
in this release, as well as various annotations, including e.g. POS tagging,
lemmatization, word segmentation, etc.

*** Subset of Italian SpeechDat-Car database (AURORA/CD0003-05) ***
The Aurora project was originally set up to establish a world wide standard
for the feature extraction software which forms the core of the front-end of
a DSR (Distributed Speech Recognition) system. ETSI formally adopted this
activity as work items 007 and 008.The two work items within ETSI are:
-       ETSI DES/STQ WI007: Distributed Speech Recognition - Front-End Feature
Extraction Algorithm & Compression Algorithm
-       ETSI DES/STQ WI008: Distributed Speech Recognition - Advanced Feature
Extraction Algorithm.

This database is a subset of the Italian SpeechDat-Car database which has
been collected as part of the European Union funded SpeechDat-Car project.
It contains contains 2200 Italian connected digit utterances divided into
training
and testing utterances in the following noise and driving conditions inside
a car:
- High speed good road
- Low speed rough road
- Stopped with motor running
- Town traffic

*** Aurora 4a & 4b ***
The Aurora project is now releasing a number of list files for performing the
training and testing on the Wall Street Journal (WSJ0) data at two sampling
rates -8 kHz and 16 kHz. The Aurora 4a database is based on the WSJ0
with artificial addition of noise over a range of signal to noise ratios.
It contains
both clean and multicondition training sets and 14 evaluation sets with
different
noise types and microphones.
An additional database, Aurora 4b, will be released later, that will
contain noisy
versions of the Nov'92 WSJ0 development set.



More information about the Corpora mailing list