[Corpora-List] A Christmas Present from Lancaster (Part Two)

Mcenery, Tony eiaamme at exchange.lancs.ac.uk
Tue Dec 23 12:20:45 UTC 2003


Dear All,

I am delighted to be able to announce the release of the EMILLE/CIIL
corpus. The corpus contains monolingual written corpus data for 14 South
Asian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri,
Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). It
also contains orthographically transcribed spoken data and parallel
corpus data for five South Asian languages (Bengali, Gujarati, Hindi,
Punjabi and Urdu). In addition, the parallel corpus contains the English
originals from which the translations stored in the corpus were derived.
All data in the corpus is CES and Unicode compliant. The EMILLE corpus
totals some 94 million words. 

The corpora were built as part of a collaboration between Lancaster
University and the Central Institute of Indian Languages, Mysore. 

As well as the corpora, the following materials are also available for
download from the web-site:

i.) documentation relating to the corpus;
ii.) POS tagged Urdu corpus data;
iii.) Hindi corpus data in which demonstrative use has been subject to
annotation;
iv.) A prototype POS tagger for Urdu.

The corpus can be downloaded from:

http://www.ling.lancs.ac.uk/corplang/emille

More details of the EMILLE project can be found at:

http://www.emille.lancs.ac.uk

The GATE language engineering architecture has also been developed
further by the University of Sheffield to enable language processing
tasks using the EMILLE data. For more details on GATE see:

http://www.gate.ac.uk/

A new release of the EMILLE corpus will be made, indexed for use with
Xara, towards spring 2004.

Apologies if you receive this message more than once.

Regards,

Tony McEnery,
Professor of English Language and Linguistics,
Dept. Linguistics and Modern English Language,
Lancaster University,
Bailrigg,
Lancaster,
LA1 4YT.


I've stopped 14,921 spam messages. You can too!
One month FREE spam protection at http://www.cloudmark.com/spamnetsig/



More information about the Corpora mailing list