[Corpora-List] C-ORAL-ROM spoken corpus

Thu Jan 20 20:47:37 UTC 2005

The C-ORAL-ROM corpus is available at ELRA/ELDA.

C-ORAL-ROM is a multilingual corpus of spontaneous speech for four
romance languages (French, Italian,  Portuguese, Spanish) of around
1,200,000 words (IST 2000-26228).  The corpus consists of four
comparable recording collections of Italian, French, Portuguese and
Spanish spontaneous speech sessions (around 300,000 words for each
Language). The collections are delivered respectively by the following
providers:

    * Università di Firenze (Dipartimento di Italianistica, LABLITA);
    * Université de Provence (DELIC team, Description Linguistique
      Informatisée sur Corpus);
    * Fundação da Universidade de Lisboa/Centro de Linguística da
      Universidade de Lisboa
    * Universidad Autónoma de Madrid (Departamento de Lingüística,
      Lenguas Modernas, Lógica y F. de la Ciencia, Laboratorio de
      Lingüística Informática).

The C-ORAL-ROM corpus provides the acoustic source of each session
together with the following main annotations:

    * The orthographic transcription, in CHAT format, enriched with the
      tagging of terminal and non terminal prosodic breaks
    * Session metadata
    * The text to speech synchronization, in WIN PITCH CORPUS format,
      based on the alignment of each transcribed utterance. The WIN
      PITCH CORPUS software is provided with the ressource.

More details in the ELRA/ELDA Catalogue:

http://www.elda.org/catalogue/en/speech/S0172.html

--
Jean Véronis
  Home: http://www.up.univ-mrs.fr/veronis
  Blog: http://aixtal.blogspot.com