Corpora: ELRA News
Valerie Mapelli
mapelli at elda.fr
Thu Dec 28 10:54:20 UTC 2000
[ We apologise for the duplicate posting of this announcement ]
___________________________________________________________
ELRA
European Language Resources Association
ELRA News
___________________________________________________________
*** ELRA NEW RESOURCES ***
We are happy to announce new resources available via ELRA:
- Telephone Speech Resources
ELRA-S0090 Polish SpeechDat(E) Database
ELRA-S0092 Portuguese SpeechDat(II) FDB-4000
- Desktop Microphone Speech Resources
ELRA-S0087 BABEL Hungarian Database
ELRA-S0088 Twin database - TWINDB1
ELRA-S0089 Albayzin corpus
ELRA-S0093 IBNC - An Italian Broadcast News Corpus
- Speech Related Resources
ELRA-S0091 Pronunciation lexicon of British place names,
surnames and first names
- Written Corpus
ELRA-W0025 A "scientific" corpus of modern French
(La Recherche magazine)
- Multilingual Lexicons
ELRA-M0025 Bilingual English-Russian Russian-English Dictionaries
A short description of each database is given below.
_______________________________________
TELEPHONE SPEECH RESOURCES
_______________________________________
- ELRA-S0090 Polish SpeechDat(E) Database
This database comprises 1000 Polish speakers (488 males,
512 females) recorded over the Polish fixed telephone network.
- ELRA-S0092 Portuguese SpeechDat(II) FDB-4000
This database comprises 4027 Portuguese speakers (1861 males,
2166 females) recorded over the Portuguese fixed telephone network.
_______________________________________
DESKTOP/MICROPHONE SPEECH RESOURCES
_______________________________________
- ELRA-S0087 BABEL Hungarian Database
The BABEL Database is a speech database that was produced by
a research consortium funded by the European Union under the
COPERNICUS programme (COPERNICUS Project 1304).
The Hungarian database consists of:
- the basic "common" set which contains the Many Talker Set (30 males,
30 females), Few Talker Set (4 males, 4 females), Very Few Talker Set
(1 male, 1 female);
-- and the extension part: a short description of Hungarian sound system
- ELRA-S0088 Twin database - TWINDB1
The Twin database named TWINDB1 includes recordings of 45 French
speakers, consisting of 9 pairs of identical twins (8 males and 10 females)
with similar voices, and 27 other speakers (13 males and 14 females)
including 4 none-twin siblings.
- ELRA-S0089 Albayzin corpus
This corpus consists of 3 sub-corpora of 16 kHz 16 bits signals,
recorded by 304 Castillian speakers: Phonetic corpus, Geographic corpus,
"Lombard" corpus
- ELRA-S0093 IBNC - An Italian Broadcast News Corpus
Produced within the European Commission funded project LRsP&P
(Language Resources Production & Packaging - LE4-8335), the collection
consists of 150 broadcast programs from the RAI, for a total time of about
30 hours, issued in 36 different days, between 1992 and 1999.
down-sampled to 16kHz 16 bit, and encoded into the NIST Sphere PCM
format.
_______________________________________
SPEECH RELATED RESOURCES
_______________________________________
- ELRA-S0091 Pronunciation lexicon of British place names, surnames and
first names
This pronunciation lexicon produced within the European Commission funded
project LRsP&P (Language Resources Production & Packaging - LE4-8335)
is an SGML-encoded database. It contains 160,000 entries of British
place-names, surnames and first names All phonemic transcriptions in the
database are based on the SAMPA phonetic alphabet.
_______________________________________
WRITTEN CORPUS
_______________________________________
- ELRA-W0025 A "scientific" corpus of modern French (La Recherche magazine)
Produced within the European Commission funded project LRsP&P (Language
Resources Production & Packaging - LE4-8335), the corpus contains all articles
published in La Recherche magazine in 1998, including issues 305 (January) to
315 (December), which amounts to 447,244 tokens and 30,238 types. Two
versions are available: the raw data (XML format) and the complete version
(XML
and SGML formats)
_______________________________________
MULTILINGUAL LEXICONS
_______________________________________
- ELRA-M0025 Bilingual English-Russian Russian-English Dictionaries
Produced within the European Commission funded project LRsP&P (Language
Resources Production & Packaging - LE4-8335), these bilingual dictionaries
contain more than 350,000 pairs of words (in tabular form) in XML format:
1) Russian-English dictionary - more than 130,000 entries
2) English-Russian dictionary - more than 95,000 entries
Each entry contains: source word (lemma); part of speech of source word;
target word(s) (lemma(s)), grouped by same meaning; part of speech of target
word(s); domain(s).
=====================================
For further information, please contact:
ELRA/ELDA Tel +33 01 43 13 33 33
55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30
F-75013 Paris, France E-mail mapelli at elda.fr
or visit our Web site:
http//www.icp.grenet.fr/ELRA/home.html
or http//www.elda.fr
=====================================
More information about the Corpora
mailing list