[Corpora-List] ELRA News

Wed Sep 22 15:17:19 UTC 2004

**********************************************************
ELRA - Language Resources Catalogue - Update
*********************************************************

We are happy to announce that new Written Language
Resources are available in our catalogue.

You will find below their short descriptions. Please
visit our on-line catalogue to get more detailed
information: www.elda.fr and www.elra.info.

*********************************************************
*** ELRA-W0037 The EMILLE/CIIL Corpus ***

The EMILLE/CIIL Corpus consists of monolingual corpora
containing approximately 92,799,000 words for 14 South Asian
languages (Assamese, Bengali, Gujarati, Hindi, Kannada,
Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil,
Telegu and Urdu) (including 2,627,000 words of transcribed spoken
data for Bengali, Gujarati, Hindi, Punjabi and Urdu), a parallel corpus
of 200,000 words in English with translations in Hindi, Bengali, Punjabi,
Gujarati and Urdu. Annotations include Urdu monolingual and parallel
corpora annotated for parts-of-speech, and 20 written Hindi corpus files
annotated to show the nature of demonstrative use. All other components
are annotated at the sentence level. The corpus is marked up using CES-
compliant SGML and encoded using Unicode.

*** ELRA-W0038 The EMILLE Lancaster Corpus ***

The EMILLE Lancaster Corpus consists of monolingual corpora
containing approximately 58,880,000 words for seven South Asian
languages (Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil and Urdu)
(including 2,627,000 words of transcribed spoken data for Bengali, Gujarati,
Hindi, Punjabi and Urdu), a parallel corpus of 200,000 words in English with
translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. Annotations include
Urdu monolingual and parallel corpora annotated for parts-of-speech, and 20
written Hindi corpus files annotated to show the nature of demonstrative use.
All other components are annotated at the sentence level. The corpus is
marked up using CES-compliant SGML and encoded using Unicode.

*** ELRA-W0039 The Lancaster Corpus of Mandarin Chinese (LCMC) ***

The Lancaster Corpus of Mandarin Chinese (LCMC) sampled 15 written
text categories including news, literary texts, academic prose and official
documents etc published in P. R. China in the earlier 1990s for a total of
approximately 1 million words. The same sampling frame and period as
FLOB/FROWN were used in LCMC. The corpus is encoded in Unicode (UTF-8)
and marked up in XML.

*********************************************************

---------------------------------------------------------------------------
ELRA / ELDA

55-57, rue Brillat-Savarin
75013 Paris FRANCE
Tel: (+33) 1 43 13 33 33 / Fax: (+33) 1 43 13 33 30
URL: http://www.elra.info or http://www.elda.fr

LREC 2004 conference: www.lrec-conf.org/lrec2004/
LangTech forum: http://www.lang-tech.org
---------------------------------------------------------------------------