[Corpora-List] 10 Years of OPUS
Jörg Tiedemann
Jorg.Tiedemann at lingfil.uu.se
Sat Nov 2 18:55:39 UTC 2013
OPUS is a growing collection of parallel corpora for many languages and various domains. The collection becomes pretty big and includes a variety of data sets and tools that are not only useful for statistical machine translation. OPUS has been extended a lot since its first appearance in 2003. Actually the best birthday present would be if anyone would decide to start a mirror of OPUS. Let me know if you are interested.
Here some of the highlights:
- over 150 languages and language variants
- over 5 billion aligned translation units
- downloads in XML/XCES, plain text (Moses/SMT) and TMX
- raw, tokenized and machine-annotated data
- monolingual data sets (for language modeling)
- search interfaces
Some recent news and data sets:
- EUbookshop: a large but noisy corpus (converted from PDF)
- Tatoeba: a small but clean corpus with many languages
- OpenSubtitles2012: an improved version of the 2011 version
- coming soon: OpenSubtitles2013 - an extension of OpenSubtitles2012
- UN, MultiUN, Europarl v7: aligned for all language combinations
- word alignments and phrase tables for the majority of bitexts
The Web Site: http://opus.lingfil.uu.se
More information: http://opus.lingfil.uu.se/trac/wiki
Feedback is very welcome!
And, be nice to our server!
Jörg Tiedemann
**********************************************************************************
Jörg Tiedemann jorg.tiedemann at lingfil.uu.se<mailto:jorg.tiedemann at lingfil.uu.se>
Dep. of Linguistics and Philology http://stp.lingfil.uu.se/~joerg/
Uppsala University tel: +46 (0)18 - 471 1412
Box 635, SE-751 26 Uppsala/Sweden fax: +46 (0)18 - 471 1094
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20131102/1718902a/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list