11.472, FYI: New Resources/ELRA, Summer: Spoken Lang/Context

The LINGUIST Network linguist at linguistlist.org
Mon Mar 6 04:21:29 UTC 2000


LINGUIST List:  Vol-11-472. Sun Mar 5 2000. ISSN: 1068-4875.

Subject: 11.472, FYI: New Resources/ELRA, Summer: Spoken Lang/Context

Moderators: Anthony Rodrigues Aristar: Wayne State U.<aristar at linguistlist.org>
            Helen Dry: Eastern Michigan U. <hdry at linguistlist.org>
            Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Reviews: Andrew Carnie: U. of Arizona <carnie at linguistlist.org>

Associate Editors:  Martin Jacobsen <marty at linguistlist.org>
                    Ljuba Veselinova <ljuba at linguistlist.org>
		    Scott Fults <scott at linguistlist.org>
		    Jody Huellmantel <jody at linguistlist.org>
		    Karen Milligan <karen at linguistlist.org>

Assistant Editors:  Lydia Grebenyova <lydia at linguistlist.org>
		    Naomi Ogasawara <naomi at linguistlist.org>
		    James Yuells <james at linguistlist.org>

Software development: John H. Remmers <remmers at emunix.emich.edu>
                      Sudheendra Adiga <sudhi at linguistlist.org>
                      Qian Liao <qian at linguistlist.org>

Home Page:  http://linguistlist.org/


Editor for this issue: Lydia Grebenyova <lydia at linguistlist.org>

=================================Directory=================================

1)
Date:  March 2, 2000 14:30:26 +0100
From:  Valerie Mapelli <mapelli at elda.fr>
Subject:  New Resources/ European Lang Resources Association (ELRA)

2)
Date:  Sat, 4 Mar 2000 14:22:33 -0500
From:  Keith Johnson <kjohnson at ling.ohio-state.edu>
Subject:  Summer Session: Spoken Language in Context/ Ohio State University

-------------------------------- Message 1 -------------------------------

Date:  March 2, 2000 14:30:26 +0100
From:  Valerie Mapelli <mapelli at elda.fr>
Subject:  New Resources/ European Lang Resources Association (ELRA)


___________________________________________________________
				ELRA
		European Language Resources Association
			       ELRA News
___________________________________________________________


		     *** ELRA NEW RESOURCES ***


We are happy to announce new resources available via ELRA

ELRA-W0020 ICE-GB (British English component of the
International Corpus of English)
ELRA-S0077 Telephone Speech Data Collection for Czech
ELRA-S0078 Finnish Speechdat(II) FDB-1000
ELRA-S0079 Finnish Speechdat(II) FDB-4000
ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000

A description of each database is given below.

_______________________________________
ELRA-W0020 ICE-GB (British English component of
the International Corpus of English)
_______________________________________

ICE-GB is the British component of the International Corpus
of English (ICE). ICE began in 1990 with the primary aim
of providing material for comparative studies of varieties of
English throughout the world. Twenty centres around the
world are preparing corpora of their own national or regional
variety of English.

ICE-GB is fully grammatically analysed. Like all the ICE
corpora, ICE-GB consists of a million words of spoken and
written English and adheres to the common corpus design.
200 written and 300 spoken texts make up the million words.
Every text is grammatically annotated, allowing complex and
detailed searches across the whole corpus.

ICE-GB contains 83,394 parse trees, including 59,640 in
the spoken part of the corpus.

ICE-GB has been fully checked. It was checked by linguists
at several stages in its completion, using both a traditional
`post-checking' strategy and also by cross-sectional
error-based searches.

ICE-GB is distributed with the retrieval software ICECUP
(the International Corpus of English Corpus Utility Program).
ICECUP supports a variety of query types, including the use
of the parse analyses to construct Fuzzy Tree Fragments to
search the corpus.

_______________________________________
ELRA-S0077 Telephone Speech Data Collection for Czech
_______________________________________

This database contains speech collected in Czech Republic
during summer 1999. The collection was performed at the
Institute of Radioelectronics of Brno University of
Technology, Faculty of Electrical Engineering and Computer
Sciences (VUT Brno) and at the Department of Circuit
Theory of Czech Technical University in Prague, Faculty of
Electrical Engineering (CVUT Prague) upon demand of
Siemens AG, Corporate Technology, Munich. This database
comprises telephone recordings from 1227 speakers (590
males and 637 females) recorded directly over the fixed
telephone network using an ISDN interface.

Speech files are stored as sequences of 8bit 8 kHz A-law
uncompressed speech samples. Each prompted utterance
is stored within a separate file. Each speech file has an
accompanying ASCII SAM label file according to the
specifications of the SpeechDat project
(URL http//www.speechdat.com ).

Corpus contents connected digits (prompt sheet number,
telephone number, credit card number); sequences of
isolated digits (5 digits); answers to yes/no questions;
common application words and phrases.

The following age distribution has been obtained 36
speakers are below 16 years old, 537 speakers are between
16 and 30, 306 speakers are between 31 and 45, 259
speakers are between 46 and 60, 88 speakers are over 60,
and 1 speaker whose age is unknown.

The transcription included in this database is an
orthographic, lexical transcription with a few details that
represent audible acoustic events (speech and non speech)
present in the corresponding waveform files. SpeechDat
conventions were used in this database.

______________________________________
ELRA-S0078 Finnish Speechdat(II) FDB-1000
ELRA-S0079 Finnish Speechdat(II) FDB-4000
_______________________________________

The Finnish SpeechDat(II) FDB-1000 and FDB-4000
databases comprise respectively 1000 and 4000 Finnish
speakers recorded over the Finnish fixed telephone network.
The SpeechDat database has been collected and annotated
by the Tampere University of Technology's Digital Media
Institute. The speech databases made within the
SpeechDat(II) project were validated by SPEX, the
Netherlands, to assess their compliance with the
SpeechDat format and content specifications.

Speech samples are stored as sequences of 8-bit 8 kHz
A-law. Each prompted utterance is stored in a separate file.
Each signal file is accompanied by an ASCII SAM label file
which contains the relevant descriptive information.

Each speaker uttered the following items: 1 isolated digit; 1
sequence of 10 isolated digits; 4 numbers 1 sheet number
(5 digits), 1 telephone number (9-10 digits), 1 credit card
number (16 digits), 1 PIN code (6 digits); 1 currency money
amount; 1 natural number; 3 dates 1 spontaneous date
(birthdate), 1 prompted date, 1 relative or general date
expression; 2 time phrases 1 time of day (spontaneous), 1
time phrase; 3 spelled words 1 spontaneous own forename,
1 city name, 1 phonetically rich word; 5 directory assistance
names 1 spontaneous own forename, 1 spontaneous city of
growing up, 1 frequent city name, 1 frequent company name,
1 common forename surname; 2 yes/no questions 1
predominantly "yes" question, 1 predominantly "no" question;
3 application words; 1 word spotting phrase using an
embedded application word; 4 phonetically rich words; 9
phonetically rich sentences.

A pronunciation lexicon with a phonemic transcription in
SAMPA is also included.

______________________________________
ELRA-S0080 Finnish-Swedish Speechdat(II) FDB-1000
______________________________________

The Finnish-Swedish SpeechDat(II) FDB-1000 comprises
1000 Finnish speakers uttering speechdat items in the variant
of Swedish spoken in Finland, recorded over the Finnish
fixed telephone network. The SpeechDat database has been
collected and annotated by the Tampere University of
Technology's Digital Media Institute. The FDB-1000
database is partitioned into 4 CDs, 3 CDs comprise 300
speakers sessions, the 4th comprises 100 speakers.
The speech databases made within the SpeechDat(II)
project were validated by SPEX, the Netherlands, to assess
their compliance with the SpeechDat format and content
specifications.

Speech samples are stored as sequences of 8-bit 8 kHz
A-law. Each prompted utterance is stored in a separate file.
Each signal file is accompanied by an ASCII SAM label file
which contains the relevant descriptive information.

Each speaker uttered the following items: 1 isolated digit; 1
sequence of 10 isolated digits; 4 numbers 1 sheet number
(5 digits), 1 telephone number (9-10 digits), 1 credit card
number (16 digits), 1 PIN code (6 digits); 1 currency money
amount; 1 natural number; 3 dates 1 spontaneous date
(birthdate), 1 prompted date, 1 relative or general date
expression; 2 time phrases 1 time of day (spontaneous), 1
time phrase; 3 spelled words 1 spontaneous own forename,
1 city name, 1 phonetically rich word; 5 directory assistance
names 1 spontaneous own forename, 1 spontaneous city of
growing up, 1 frequent city name, 1 frequent company name,
1 common forename surname; 2 yes/no questions 1
predominantly "yes" question, 1 predominantly "no" question;
6 application words; 1 word spotting phrase using an
embedded application word; 4 phonetically rich words; 9
phonetically rich sentences

The following age distribution has been obtained 178
speakers are below 16 years old, 412 speakers are between
16 and 30, 216 speakers are between 31 and 45, 160
speakers are between 46 and 60, and 34 speakers are over 60.

A pronunciation lexicon with a phonemic transcription in
SAMPA is also included.

=====================================
For further information, please contact:

     ELRA/ELDA	                       Tel  +33 01 43 13 33 33
     55-57 rue Brillat-Savarin         Fax  +33 01 43 13 33 30
     F-75013 Paris, France             E-mail  mapelli at elda.fr

or visit our Web site:

     http//www.icp.grenet.fr/ELRA/home.html
     or http//www.elda.fr
=====================================






-------------------------------- Message 2 -------------------------------

Date:  Sat, 4 Mar 2000 14:22:33 -0500
From:  Keith Johnson <kjohnson at ling.ohio-state.edu>
Subject:  Summer Session: Spoken Language in Context/ Ohio State University

Summer 2000 at Ohio State University

Spoken Language in Context: Methods and Models

During July of 2000, the Department of Linguistics at the
Ohio State University will be offering a unique combination
of short courses aimed at exploring spoken language, with a
particular focus on the empirical study of naturally-occurring
speech through various instrumental, quantitative, and analytic
means.  Scholars, researchers (industry or academic), and
students are invited to join us for an intense and rewarding
summer session.

Course offerings:
     Laboratory Phonology - Mary Beckman
     Quantitative Methods - Michael Broe
     Field Phonetics - Keith Johnson
     Historical Phonology - Brian Joseph & Richard Janda
     Practicum in English Intonation - Julia McGory
     The Pragmatics of Focus - Craige Roberts

For more information see the website:
http://ling.ohio-state.edu/SU2000

---------------------------------------------------------------------------
LINGUIST List: Vol-11-472



More information about the LINGUIST mailing list