Ressources: ELRA - Language Resources Catalogue - Update

Thierry Hamon thierry.hamon at UNIV-PARIS13.FR
Sat Sep 24 19:03:38 UTC 2011

Date: Wed, 21 Sep 2011 15:23:11 +0200
From: Info <info at>
Message-ID: <4E79E53F.8020800 at>

Our apologies if you have received multiple copies of this announcement.

ELRA - Language Resources Catalogue - Update

ELRA is happy to announce that 4 new Speech Resources from the 
GlobalPhone corpus are now available in its catalogue.
Moreover, an updated version of the Venice Italian Treebank (VIT) has 
also been released.
1) New Language Resources:

The GlobalPhone Corpus: *The GlobalPhone corpus was designed to provide
read speech data for the development and evaluation of large continuous
speech recognition systems in the most widespread languages of the
world, and to provide a uniform, multilingual speech and text database
for language independent and language adaptive speech recognition as
well as for language identification tasks. The entire GlobalPhone corpus
enables the acquisition of acoustic-phonetic knowledge of the following
19 spoken languages Arabic (ELRA-S0192), Bulgarian (ELRA-S0319),
Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian
(ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German
(ELRA-S0198), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish
(ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202),
Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil
(ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese
(ELRA-S0322). In each language about 100 sentences were read from each
of the 100 speakers. The read texts were selected from national
newspapers available via Internet to provide a large vocabulary (up to
65,000 words). The read articles cover national and international
political news as well as economic news.

Special prices are offered for a combined purchase of several
GlobalPhone languages (5 languages, 10 languages, 15 languages or 19

*New 4 languages are available from the GlobalPhone corpus*:
**ELRA-S0319 GlobalPhone Bulgarian*
  For more information, see:
*ELRA-S0320**GlobalPhone Polish*
  For more information, see:
*ELRA-S0321 **GlobalPhone Thai*
  For more information, see:
*ELRA-S0322 **GlobalPhone Vietnamese*
  For more information, see:

*2) Update of **ELRA-W0040 Venice Italian Treebank (VIT)**:*
The new version of VIT has a totally revised constituent-based
representation and a completely new dependency-based representation
which has been achieved by semi-automatic procedures.*

*The VIT, Venice Italian Treebank contains about 272,000 words
distributed over six different domains: bureaucratic, political,
economic and financial, literary, scientific, and news. In addition,
some 60,000 tokens of spoken dialogues in different Italian varieties
were annotated.

The annotation follows general X-bar criteria with 29 constituency
labels and 102 PoS tags. VIT is also made available in a broad
annotation version with 10 constituency labels and 22 PoS tags for
machine learning purposes. The format is plain text with square
bracketing. However, a UPenn style version which is readable by the open
source query language CorpusSearch is also provided.  *

*For more information, see:

Message diffuse par la liste Langage Naturel <LN at>
Informations, abonnement :
English version       : 
Archives                 :

La liste LN est parrainee par l'ATALA (Association pour le Traitement
Automatique des Langues)
Information et adhesion  :

More information about the Ln mailing list