Corpora: ELRA News

Wed Mar 1 13:30:26 UTC 2000

[ We apologise for the duplicate posting of this announcement ]

___________________________________________________________
				ELRA
		European Language Resources Association
			       ELRA News
___________________________________________________________

		     *** ELRA NEW RESOURCES ***

We are happy to announce new resources available via ELRA

ELRA-W0020 ICE-GB (British English component of the
International Corpus of English)
ELRA-S0077 Telephone Speech Data Collection for Czech
ELRA-S0078 Finnish Speechdat(II) FDB-1000
ELRA-S0079 Finnish Speechdat(II) FDB-4000
ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000

A description of each database is given below.

_______________________________________
ELRA-W0020 ICE-GB (British English component of
the International Corpus of English)
_______________________________________

ICE-GB is the British component of the International Corpus
of English (ICE). ICE began in 1990 with the primary aim
of providing material for comparative studies of varieties of
English throughout the world. Twenty centres around the
world are preparing corpora of their own national or regional
variety of English.

ICE-GB is fully grammatically analysed. Like all the ICE
corpora, ICE-GB consists of a million words of spoken and
written English and adheres to the common corpus design.
200 written and 300 spoken texts make up the million words.
Every text is grammatically annotated, allowing complex and
detailed searches across the whole corpus.

ICE-GB contains 83,394 parse trees, including 59,640 in
the spoken part of the corpus.

ICE-GB has been fully checked. It was checked by linguists
at several stages in its completion, using both a traditional
‘post-checking’ strategy and also by cross-sectional
error-based searches.

ICE-GB is distributed with the retrieval software ICECUP
(the International Corpus of English Corpus Utility Program).
ICECUP supports a variety of query types, including the use
of the parse analyses to construct Fuzzy Tree Fragments to
search the corpus.

_______________________________________
ELRA-S0077 Telephone Speech Data Collection for Czech
_______________________________________

This database contains speech collected in Czech Republic
during summer 1999. The collection was performed at the
Institute of Radioelectronics of Brno University of
Technology, Faculty of Electrical Engineering and Computer
Sciences (VUT Brno) and at the Department of Circuit
Theory of Czech Technical University in Prague, Faculty of
Electrical Engineering (CVUT Prague) upon demand of
Siemens AG, Corporate Technology, Munich. This database
comprises telephone recordings from 1227 speakers (590
males and 637 females) recorded directly over the fixed
telephone network using an ISDN interface.

Speech files are stored as sequences of 8bit 8 kHz A-law
uncompressed speech samples. Each prompted utterance
is stored within a separate file. Each speech file has an
accompanying ASCII SAM label file according to the
specifications of the SpeechDat project
(URL http//www.speechdat.com).

Corpus contents connected digits (prompt sheet number,
telephone number, credit card number); sequences of
isolated digits (5 digits); answers to yes/no questions;
common application words and phrases.

The following age distribution has been obtained 36
speakers are below 16 years old, 537 speakers are between
16 and 30, 306 speakers are between 31 and 45, 259
speakers are between 46 and 60, 88 speakers are over 60,
and 1 speaker whose age is unknown.

The transcription included in this database is an
orthographic, lexical transcription with a few details that
represent audible acoustic events (speech and non speech)
present in the corresponding waveform files. SpeechDat
conventions were used in this database.

______________________________________
ELRA-S0078 Finnish Speechdat(II) FDB-1000
ELRA-S0079 Finnish Speechdat(II) FDB-4000
_______________________________________

The Finnish SpeechDat(II) FDB-1000 and FDB-4000
databases comprise respectively 1000 and 4000 Finnish
speakers recorded over the Finnish fixed telephone network.
The SpeechDat database has been collected and annotated
by the Tampere University of Technology's Digital Media
Institute. The speech databases made within the
SpeechDat(II) project were validated by SPEX, the
Netherlands, to assess their compliance with the
SpeechDat format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz
A-law. Each prompted utterance is stored in a separate file.
Each signal file is accompanied by an ASCII SAM label file
which contains the relevant descriptive information.

Each speaker uttered the following items: 1 isolated digit; 1
sequence of 10 isolated digits; 4 numbers 1 sheet number
(5 digits), 1 telephone number (9-10 digits), 1 credit card
number (16 digits), 1 PIN code (6 digits); 1 currency money
amount; 1 natural number; 3 dates 1 spontaneous date
(birthdate), 1 prompted date, 1 relative or general date
expression; 2 time phrases 1 time of day (spontaneous), 1
time phrase; 3 spelled words 1 spontaneous own forename,
1 city name, 1 phonetically rich word; 5 directory assistance
names 1 spontaneous own forename, 1 spontaneous city of
growing up, 1 frequent city name, 1 frequent company name,
1 common forename surname; 2 yes/no questions 1
predominantly “yes” question, 1 predominantly “no” question;
3 application words; 1 word spotting phrase using an
embedded application word; 4 phonetically rich words; 9
phonetically rich sentences.

A pronunciation lexicon with a phonemic transcription in
SAMPA is also included.

______________________________________
ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000
______________________________________

The Finnish-Swedish SpeechDat(II) FDB-1000 comprises
1000 Finnish speakers uttering speechdat items in the variant
of Swedish spoken in Finland, recorded over the Finnish
fixed telephone network. The SpeechDat database has been
collected and annotated by the Tampere University of
Technology's Digital Media Institute. The FDB-1000
database is partitioned into 4 CDs, 3 CDs comprise 300
speakers sessions, the 4th comprises 100 speakers.
The speech databases made within the SpeechDat(II)
project were validated by SPEX, the Netherlands, to assess
their compliance with the SpeechDat format and content
specifications.

Speech samples are stored as sequences of 8-bit 8 kHz
A-law. Each prompted utterance is stored in a separate file.
Each signal file is accompanied by an ASCII SAM label file
which contains the relevant descriptive information.

Each speaker uttered the following items: 1 isolated digit; 1
sequence of 10 isolated digits; 4 numbers 1 sheet number
(5 digits), 1 telephone number (9-10 digits), 1 credit card
number (16 digits), 1 PIN code (6 digits); 1 currency money
amount; 1 natural number; 3 dates 1 spontaneous date
(birthdate), 1 prompted date, 1 relative or general date
expression; 2 time phrases 1 time of day (spontaneous), 1
time phrase; 3 spelled words 1 spontaneous own forename,
1 city name, 1 phonetically rich word; 5 directory assistance
names 1 spontaneous own forename, 1 spontaneous city of
growing up, 1 frequent city name, 1 frequent company name,
1 common forename surname; 2 yes/no questions 1
predominantly “yes” question, 1 predominantly “no” question;
6 application words; 1 word spotting phrase using an
embedded application word; 4 phonetically rich words; 9
phonetically rich sentences

The following age distribution has been obtained 178
speakers are below 16 years old, 412 speakers are between
16 and 30, 216 speakers are between 31 and 45, 160
speakers are between 46 and 60, and 34 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in
SAMPA is also included.

=====================================
For further information, please contact:

     ELRA/ELDA	               Tel  +33 01 43 13 33 33
     55-57 rue Brillat-Savarin         Fax  +33 01 43 13 33 30
     F-75013 Paris, France           E-mail  mapelli at elda.fr

or visit our Web site:

     http//www.icp.grenet.fr/ELRA/home.html
     or http//www.elda.fr
=====================================