[Corpora-List] unencumbered corpora

Francis Bond bond at cslab.kecl.ntt.co.jp
Mon Jan 24 08:12:23 UTC 2005


G'Day,

Lou Burnard <lou.burnard at computing-services.oxford.ac.uk> writes:

> Can anyone point me to any annotated language corpora which are freely
> available under something like the GNU Public Licence? All the ones I
> have thought of so far seem to be available only under some kind of
> complicated licensing scheme which precludes (e.g) commercial
> exploitation, unrestricted copying, etc. And cost money.

OPUS <http://logos.uio.no/opus/> sounds ideal.  It includes many
European  (and even non-European) texts,  is freely available (GPL or
similar licenses) and even POS tagged and marked up in XML.

>
> I'd like to have a corpus of a reasonable size (1 million+ words) in any
>   European language (tho English or French are preferable) with some
> kind of word-level annotation, which I can hack about, use in teaching,
>   and put on a freely-distributable CD, without worrying about copyright
> lawyers. There *must* be some somewhere!

It is already distributed on the Knorpora CD
<http://sslmit.unibo.it/%7ebaroni/welcome_to_knorpora.html>, a
modified version of the Knoppix 3.3 Live CD for students of
corpus-based computational linguistics.

> It doesn't even have to be in XML -- though it will be when I've
> finished with it.

--
Francis Bond  <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
NTT Communication Science Laboratories | Machine Translation Research Group



More information about the Corpora mailing list