Ressources: ELRA - Language Resources Catalogue - Update - NEMLAR resources

Thierry Hamon thierry.hamon at LIPN.UNIV-PARIS13.FR
Mon Aug 28 08:04:26 UTC 2006

Date: Fri, 11 Aug 2006 12:34:52 +0200
From: ELDA <info at>
Message-Id: < at>

Our apologies if you have received multiple copies of this announcement

ELRA - Language Resources Catalogue - Update

We are happy to announce the following Arabic resources, produced
within the NEMLAR project (  All 3 resources are owned
and copyrighted by the Nemlar Consortium. They are available in our
catalogue.  To view all the Language Resources available, you can
visit our on-line catalogue: or

*** ELRA-W0042 NEMLAR Written Corpus ***

This corpus consists of about 500,000 words of Arabic text from 13
different categories. The text is provided in 4 different versions:

·        Raw text
·        Fully vowelized text
·        Text with Arabic lexical analysis
·        Text with Arabic POS-tags

The database is distributed on 1 ISO 9660 CD-ROM volume.

For more information, see 

*** ELRA-S0219 NEMLAR Broadcast News Speech Corpus ***

The data consists of about 40 hours and is provided by ELDA of Arabic
data (mainly Standard Arabic from a number of broadcast companies);
Transcriptions follow the Transcriber conventions as used by ELDA and
focus on the orthographic, named entities, speaker/turn segmentation
levels. No phonetic transcription/segmentation is planned.

The database is distributed in 1 ISO 9660 DVD-ROM volume.

For more information, see

*** ELRA-S0220 NEMLAR Speech Synthesis Corpus ***
The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native
Egyptian speakers (male and female, 35 years old) recorded in a studio
over 2 channel (voice + laryngograph). The data collection and
transcription were performed by RDI (Egypt).

Speech samples are stored in 96 kHz, 24 bit with the least significant
byte first ("lohi" or Intel format) as (signed) integers.

The speaker read 2,032 prompted sentences covering approx. 42,000
words in three categories: transcribed speech (20%), written text
(50%), and constructed phrases (30%).

The database is provided with orthographic, prosodic and phonetic
transcriptions in SAMPA.  All transcriptions were segmented at the
utterance (sentence/command word) level, annotated at the word level
and checked manually. A pronunciation lexicon including 3,589
headwords with phonetics in SAMPA is also available.

The database is distributed on 3 ISO 9660 DVD-ROM volumes.

For more information, see 

  For more information on the catalogue, please contact Valérie
Mapelli mailto:mapelli at

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list