[Corpora-List] MULTEXT-East language resources V3

Tomaz Erjavec tomaz.erjavec at ijs.si
Wed Jun 30 15:08:56 UTC 2004

MULTEXT-East V3: http://nl.ijs.si/ME/V3/

MULTEXT-East resources are a multilingual dataset for language
engineering research and development. This dataset contains, for
Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian,
Resian, Romanian, Russian, Serbian, and Slovene, some or all of the
following resources:
- MULTEXT-East morphosyntactic specifications (free)
- MULTEXT-East morphosyntactic lexica (licence)
- MULTEXT-East morphosyntactically annotated "1984" corpus (licence)
- MULTEXT-East comparable corpus (licence)
- MULTEXT-East parallel speech corpus (free)
- and associated documentation (free).

The resources comply with the EAGLES and TEI recommendations and are
freely available for research use - to get access to the licenced
resources, you need to fill out and submit the on-line licence.

What's new in this edition?
- all corpora now encoded in XML TEI P4
- joins together the resources from Version 1 (1998) and Version 2 (2002)
- adds Serbian annotated "1984" and Resian morphosyntactic specifications
- an updated bibliography
- many errors from previous versions corrected
- and probably some new ones introduced...

Hope you find them useful!

Tomaž Erjavec           | Dept. of Knowledge Technologies
email: tomaz.erjavec at ijs.si  | Jozef Stefan Institute
www:   http://nl.ijs.si/et/  | Jamova 39, SI-1000, Ljubljana
fax:   (+386 1) 4251 038     | Slovenia

More information about the Corpora mailing list