[Corpora-List] Two new small aligned corpora

Francis Tyers spectre at ivixor.net
Thu Oct 4 09:25:11 UTC 2007


I'm working on a couple of small corpora for under-resourced languages,
they can be found here:

Southeastern European Times
http://xixona.dlsi.ua.es/~fran/setimes/
(9 Balkan languages + English)
Approx. 9,500 aligned paragraphs, ~100,000 words.
This corpus is public domain and has been automatically generated.

South African Government Services
http://xixona.dlsi.ua.es/~fran/afrikaans/index.html#corpora
(English + Afrikaans)
Approx 2,500 aligned sentences, ~80,000 words.
This corpus is CC-BY-SA and GPL and has been automatically generated and
then manually checked. There are approximately 2,000-4,000 more sentence
alignments to check.

Please let me know if you find them useful. 

Regards,

Francis Tyers


_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list