Corpora: ELRA News

Valerie Mapelli mapelli at elda.fr
Thu Dec 28 10:54:20 UTC 2000


[ We apologise for the duplicate posting of this announcement ]
___________________________________________________________
				ELRA
		European Language Resources Association
			       ELRA News
___________________________________________________________

		     *** ELRA NEW RESOURCES ***

We are happy to announce new resources available via ELRA:

- Telephone Speech Resources
      ELRA-S0090 Polish SpeechDat(E) Database
      ELRA-S0092 Portuguese SpeechDat(II) FDB-4000
- Desktop Microphone Speech Resources
      ELRA-S0087 BABEL Hungarian Database
      ELRA-S0088 Twin database - TWINDB1
      ELRA-S0089 Albayzin corpus
      ELRA-S0093 IBNC - An Italian Broadcast News Corpus
- Speech Related Resources
      ELRA-S0091 Pronunciation lexicon of British place names,
      surnames and first names
- Written Corpus
      ELRA-W0025 A "scientific" corpus of modern French
      (La Recherche magazine)
- Multilingual Lexicons
      ELRA-M0025 Bilingual English-Russian Russian-English Dictionaries


A short description of each database is given below.
_______________________________________
TELEPHONE SPEECH RESOURCES
_______________________________________
- ELRA-S0090 Polish SpeechDat(E) Database
This database comprises 1000 Polish speakers (488 males,
512 females) recorded over the Polish fixed telephone network.
- ELRA-S0092 Portuguese SpeechDat(II) FDB-4000
This database comprises 4027 Portuguese speakers (1861 males,
2166 females) recorded over the Portuguese fixed telephone network.
_______________________________________
DESKTOP/MICROPHONE SPEECH RESOURCES
_______________________________________
- ELRA-S0087 BABEL Hungarian Database
The BABEL Database is a speech database that was produced by
a research consortium funded by the European Union under the
COPERNICUS programme (COPERNICUS Project 1304).
The Hungarian database consists of:
- the basic "common" set which contains the Many Talker Set (30 males,
30 females), Few Talker Set (4 males, 4 females), Very Few Talker Set
(1 male, 1 female);
-- and the extension part: a short description of Hungarian sound system	
- ELRA-S0088 Twin database - TWINDB1
The Twin database named TWINDB1 includes recordings of 45 French
speakers, consisting of 9 pairs of identical twins (8 males and 10 females)
with similar voices, and 27 other speakers (13 males and 14 females)
including 4 none-twin siblings.
- ELRA-S0089 Albayzin corpus
This corpus consists of 3 sub-corpora of 16 kHz 16 bits signals,
recorded by 304 Castillian speakers:  Phonetic corpus, Geographic corpus,
"Lombard" corpus
- ELRA-S0093 IBNC - An Italian Broadcast News Corpus
Produced within the European Commission funded project LRsP&P
(Language Resources Production & Packaging - LE4-8335), the collection
consists of 150 broadcast programs from the RAI, for a total time of about
30 hours, issued in 36 different days, between 1992 and 1999.
down-sampled to 16kHz 16 bit, and encoded into the NIST Sphere PCM
format.
_______________________________________
SPEECH RELATED RESOURCES
_______________________________________
- ELRA-S0091 Pronunciation lexicon of British place names, surnames and
first names
This pronunciation lexicon produced within the European Commission funded
project LRsP&P (Language Resources Production & Packaging - LE4-8335)
is an SGML-encoded database. It contains 160,000 entries of British
place-names, surnames and first names  All phonemic transcriptions in the
database are based on the SAMPA phonetic alphabet.
_______________________________________
WRITTEN CORPUS
_______________________________________
- ELRA-W0025 A "scientific" corpus of modern French (La Recherche magazine)
Produced within the European Commission funded project LRsP&P (Language
Resources Production & Packaging - LE4-8335), the corpus contains all articles
published in La Recherche magazine in 1998, including issues 305 (January) to
315 (December), which amounts to 447,244 tokens and 30,238 types. Two
versions are available: the raw data (XML format) and the complete version
(XML
and SGML formats)
_______________________________________
MULTILINGUAL LEXICONS
_______________________________________
- ELRA-M0025 Bilingual English-Russian Russian-English Dictionaries
Produced within the European Commission funded project LRsP&P (Language
Resources Production & Packaging - LE4-8335), these bilingual dictionaries
contain more than 350,000 pairs of words (in tabular form) in XML format:
     1) Russian-English dictionary - more than 130,000 entries
     2) English-Russian dictionary - more than 95,000 entries
Each entry contains: source word (lemma); part of speech of source word;
target word(s) (lemma(s)), grouped by same meaning; part of speech of target
word(s); domain(s).

=====================================
For further information, please contact:

      ELRA/ELDA	               Tel  +33 01 43 13 33 33
      55-57 rue Brillat-Savarin         Fax  +33 01 43 13 33 30
      F-75013 Paris, France           E-mail  mapelli at elda.fr

or visit our Web site:

      http//www.icp.grenet.fr/ELRA/home.html
      or http//www.elda.fr
=====================================



More information about the Corpora mailing list