[Corpora-List] OPUS v0.2 is available
Jörg Tiedemann
joerg at stp.ling.uu.se
Sat Jul 12 10:06:13 UTC 2003
OPUS is an open source parallel corpus which is available from
http://logos.uio.no/opus/
Version 0.2 of the corpus contains roughly 30 million tokens
in 60 languages. OPUS is sentence aligned (1830 language pairs),
tokenized, and partly tagged.
The following subcorpora are included:
OpenOffice.org ca 2,5 million words 6 languages
PHP manuals ca 3,2 million words 21 languages
KDE messages ca 20,5 million words 60 languages
KDE manuals ca 3,8 million words 24 languages
More information can be found on the OPUS home page.
---------------------------
Jörg Tiedemann (http://stp.ling.uu.se/~joerg/)
Lars Nygaard (http://folk.uio.no/larsnyg/)
=======================================================================
The following tools have been used (not including standard GNU-tools):
* align - sentence aligner (based on Gale&Church, 1993)
* OpenNLP & Grok, Jason Baldridge and Gann Bierner
http://grok.sourceforge.net/
* TnT - Statistical Part-of-Speech Tagging, Thorsten Brants
http://www.coli.uni-sb.de/~thorsten/tnt/
* TreeTagger - Decision Tree Tagger, Helmut Schmid
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
* ChaSen - japanese tokenizer + tagger
http://chasen.aist-nara.ac.jp/
* recode - convert between various character encodings
(http://www.iro.umontreal.ca/contrib/recode/HTML/)
* tidy - validate, correct, and pretty-print XML-files
(http://www.w3.org/People/Raggett/tidy/)
* Uplug - tokenizer, sentence-splitter, XML-tools
http://stp.ling.uu.se/plug/
More information about the Corpora
mailing list