Corpora: ELRA News
Valerie Mapelli
mapelli at elda.fr
Tue Apr 11 09:19:47 UTC 2000
[ We apologise for the duplicate posting of this announcement ]
___________________________________________________________
ELRA
European Language Resources Association
ELRA News
___________________________________________________________
*** ELRA NEW RESOURCES ***
We are happy to announce new resources available via ELRA:
ELRA-S0058 RVG1 (Regional Variants of German 1)
ELRA-S0081 Norwegian SpeechDat(II) FDB-1000
ELRA-S0082 Siemens Synthesis Corpus - SI1000P
ELRA-W0020 PAROLE French Corpus
ELRA-W0022 ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
ELRA-L0033 LusoLEX European Portuguese Lexicon
ELRA-L0034 BrasiLEX Brazilian Portuguese lexicon
A short description of each database is given below.
_______________________________________
ELRA-S0058 RVG1 (Regional Variants of German 1)
_______________________________________
We would like to inform you that the ELRA-S0058 RVG1
has been extended by 421 speakers, recorded through
high quality microphones. More information about this
database is available on the ELRA Web site.
_______________________________________
ELRA-S0081 Norwegian SpeechDat(II) FDB-1000
_______________________________________
The Norwegian SpeechDat(II) FDB-1000 comprises 1016
Norwegian speakers (517 males, 499 females) recorded over
the Norwegian fixed telephone network. The SpeechDat database
has been collected and annotated by Telenor Research and
Development. The FDB-1000 database is partitioned into 4 CDs.
The speech databases made within the SpeechDat(II) project were
validated by SPEX, the Netherlands, to assess their compliance
with the SpeechDat format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law.
Each prompted utterance is stored in a separate file. Each signal
file is accompanied by an ASCII SAM label file which contains the
relevant descriptive information. A pronunciation lexicon with a
phonemic transcription in SAMPA is also included.
_______________________________________
ELRA-S0082 Siemens Synthesis Corpus - SI1000P
_______________________________________
The SI1000P recordings were done to provide material for high
quality concatenate speech synthesis. It contains 1000 newspaper
sentences read by two German professional broadcasting announcers
in studio quality together with the laryngographic signal and the glottal
pulse stream. Parts of the corpus were labelled and segmented
phonemically (SAM-PA) and prosodically (borders + accents).
_______________________________________
ELRA-W0020 PAROLE French Corpus
_______________________________________
The PAROLE French corpus contains a total of 20 093 099 words, that
include the following data:
Miscellaneous: (CRATER, MLCC Multilingual and Parallel Corpora): 2 025 964
words
Books: CNRS Editions: 3 267 409 words
Periodicals: CNRS Info, Hermès: 942 963 words
Newspapers: Le Monde, provided by ELRA: 13 856 763 words
The resulting resources are conformant to the PAROLE format.
_____________________________________
ELRA-W0022 ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
_______________________________________
This corpus contains approximately 3 million words from the daily
newspaper ELEFTHEROTYPIA, classified and annotated accordingly to
the common core PAROLE encoding standard. The format of the corpus
is SGML files. A subset of the corpus (250,000 words) is
morpho-syntactically tagged; all the words are also lemmatised and checked.
_______________________________________
ELRA-L0033 LusoLEX European Portuguese Lexicon
_______________________________________
Multifunctional monolingual lexicon of the European variety of Portuguese,
consisting of about 61,000 entries (lemmas) and 1,600 correspondent
inflexion paradigms. The set of entries includes compound words and
the inflexion paradigms include information regarding enclitics,
augmentatives and diminutives. Morphological information is encoded
with maximum granularity and is conformant with the EAGLES recommendations.
_______________________________________
ELRA-L0034 BrasiLEX Brazilian Portuguese lexicon
_______________________________________
Multifunctional monolingual lexicon of the Brazilian variety of Portuguese,
consisting of about 65,000 entries (lemmas) and 1,600 correspondent
inflexion paradigms. The set of entries includes compound words and
the inflexion paradigms include information regarding enclitics and
augmentative/diminutive degree. Morphological information is encoded
with maximum granularity and is conformant with the EAGLES recommendations.
=====================================
For further information, please contact:
ELRA/ELDA Tel +33 01 43 13 33 33
55-57 rue Brillat-Savarin Fax +33 01 43 13 33 30
F-75013 Paris, France E-mail mapelli at elda.fr
or visit the online catalogue on our Web site:
http://www.icp.grenet.fr/ELRA/home.html
or http://www.elda.fr
=====================================
More information about the Corpora
mailing list