[Corpora-List] OPUS v0.2 is available

Jörg Tiedemann joerg at stp.ling.uu.se
Sat Jul 12 10:06:13 UTC 2003


OPUS is an open source parallel corpus which is available from
http://logos.uio.no/opus/

Version 0.2 of the corpus contains roughly 30 million tokens
in 60 languages. OPUS is sentence aligned (1830 language pairs),
tokenized, and partly tagged.

The following subcorpora are included:
   OpenOffice.org    ca  2,5 million words     6 languages
   PHP manuals       ca  3,2 million words    21 languages
   KDE messages      ca 20,5 million words    60 languages
   KDE manuals       ca  3,8 million words    24 languages

More information can be found on the OPUS home page.


                      ---------------------------
                      Jörg Tiedemann (http://stp.ling.uu.se/~joerg/)
                      Lars Nygaard (http://folk.uio.no/larsnyg/)



=======================================================================

The following tools have been used (not including standard GNU-tools):

* align - sentence aligner (based on Gale&Church, 1993)
* OpenNLP & Grok, Jason Baldridge and Gann Bierner
  http://grok.sourceforge.net/
* TnT - Statistical Part-of-Speech Tagging, Thorsten Brants
  http://www.coli.uni-sb.de/~thorsten/tnt/
* TreeTagger - Decision Tree Tagger, Helmut Schmid
  http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
* ChaSen - japanese tokenizer + tagger
  http://chasen.aist-nara.ac.jp/
* recode - convert between various character encodings
  (http://www.iro.umontreal.ca/contrib/recode/HTML/)
* tidy - validate, correct, and pretty-print XML-files
  (http://www.w3.org/People/Raggett/tidy/)
* Uplug - tokenizer, sentence-splitter, XML-tools
  http://stp.ling.uu.se/plug/



More information about the Corpora mailing list