Corpora: ELRA News

Valerie Mapelli mapelli at elda.fr
Tue Apr 11 09:19:47 UTC 2000


[ We apologise for the duplicate posting of this announcement ]
___________________________________________________________
				ELRA
		European Language Resources Association
			       ELRA News 
___________________________________________________________

		     *** ELRA NEW RESOURCES ***

We are happy to announce new resources available via ELRA:

ELRA-S0058 RVG1 (Regional Variants of German 1)
ELRA-S0081 Norwegian SpeechDat(II) FDB-1000
ELRA-S0082 Siemens Synthesis Corpus - SI1000P
ELRA-W0020 PAROLE French Corpus
ELRA-W0022 ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
ELRA-L0033 LusoLEX European Portuguese Lexicon
ELRA-L0034 BrasiLEX Brazilian Portuguese lexicon

A short description of each database is given below.

_______________________________________
ELRA-S0058 RVG1 (Regional Variants of German 1)
_______________________________________

We would like to inform you that the ELRA-S0058 RVG1
has been extended by 421 speakers, recorded through
high quality microphones. More information about this
database is available on the ELRA Web site.

_______________________________________
ELRA-S0081 Norwegian SpeechDat(II) FDB-1000
_______________________________________

The Norwegian SpeechDat(II) FDB-1000 comprises 1016 
Norwegian speakers (517 males, 499 females) recorded over 
the Norwegian fixed telephone network. The SpeechDat database 
has been collected and annotated by Telenor Research and 
Development. The FDB-1000 database is partitioned into 4 CDs.
The speech databases made within the SpeechDat(II) project were 
validated by SPEX, the Netherlands, to assess their compliance 
with the SpeechDat format and content specifications.
Speech samples are stored as sequences of 8-bit 8 kHz A-law. 
Each prompted utterance is stored in a separate file. Each signal 
file is accompanied by an ASCII SAM label file which contains the 
relevant descriptive information. A pronunciation lexicon with a 
phonemic transcription in SAMPA is also included.

_______________________________________
ELRA-S0082 Siemens Synthesis Corpus - SI1000P
_______________________________________

The SI1000P recordings were done to provide material for high 
quality concatenate speech synthesis. It contains 1000 newspaper 
sentences read by two German professional broadcasting announcers 
in studio quality together with the laryngographic signal and the glottal 
pulse stream. Parts of the corpus were labelled and segmented 
phonemically (SAM-PA) and prosodically (borders + accents).

_______________________________________
ELRA-W0020 PAROLE French Corpus
_______________________________________

The PAROLE French corpus contains a total of 20 093 099 words, that
include the following data:
Miscellaneous: (CRATER, MLCC Multilingual and Parallel Corpora): 2 025 964
words
Books: CNRS Editions: 3 267 409 words
Periodicals: CNRS Info, Hermès: 942 963 words
Newspapers: Le Monde, provided by ELRA:  13 856 763 words

The resulting resources are conformant to the PAROLE format.

_____________________________________
ELRA-W0022 ILSP/ELEFTHEROTYPIA Corpus (Greek corpus)
_______________________________________

This corpus contains approximately 3 million words from the daily 
newspaper ELEFTHEROTYPIA, classified and annotated accordingly to 
the common core PAROLE encoding standard. The format of the corpus 
is SGML files. A subset of the corpus (250,000 words) is 
morpho-syntactically tagged; all the words are also lemmatised and checked. 

_______________________________________
ELRA-L0033 LusoLEX European Portuguese Lexicon
_______________________________________

Multifunctional monolingual lexicon of the European variety of Portuguese, 
consisting of about 61,000 entries (lemmas) and 1,600 correspondent 
inflexion paradigms. The set of entries includes compound words and 
the inflexion paradigms include information regarding enclitics, 
augmentatives and diminutives. Morphological information is encoded 
with maximum granularity and is conformant with the EAGLES recommendations. 

_______________________________________
ELRA-L0034 BrasiLEX Brazilian Portuguese lexicon
_______________________________________

Multifunctional monolingual lexicon of the Brazilian variety of Portuguese, 
consisting of about 65,000 entries (lemmas) and 1,600 correspondent 
inflexion paradigms. The set of entries includes compound words and 
the inflexion paradigms include information regarding enclitics and 
augmentative/diminutive degree. Morphological information is encoded 
with maximum granularity and is conformant with the EAGLES recommendations.

=====================================
For further information, please contact:

     ELRA/ELDA	               Tel  +33 01 43 13 33 33
     55-57 rue Brillat-Savarin         Fax  +33 01 43 13 33 30
     F-75013 Paris, France           E-mail  mapelli at elda.fr

or visit the online catalogue on our Web site:

     http://www.icp.grenet.fr/ELRA/home.html
     or http://www.elda.fr
===================================== 



More information about the Corpora mailing list