[Corpora-List] unencumbered corpora
Francis Bond
bond at cslab.kecl.ntt.co.jp
Mon Jan 24 08:12:23 UTC 2005
G'Day,
Lou Burnard <lou.burnard at computing-services.oxford.ac.uk> writes:
> Can anyone point me to any annotated language corpora which are freely
> available under something like the GNU Public Licence? All the ones I
> have thought of so far seem to be available only under some kind of
> complicated licensing scheme which precludes (e.g) commercial
> exploitation, unrestricted copying, etc. And cost money.
OPUS <http://logos.uio.no/opus/> sounds ideal. It includes many
European (and even non-European) texts, is freely available (GPL or
similar licenses) and even POS tagged and marked up in XML.
>
> I'd like to have a corpus of a reasonable size (1 million+ words) in any
> European language (tho English or French are preferable) with some
> kind of word-level annotation, which I can hack about, use in teaching,
> and put on a freely-distributable CD, without worrying about copyright
> lawyers. There *must* be some somewhere!
It is already distributed on the Knorpora CD
<http://sslmit.unibo.it/%7ebaroni/welcome_to_knorpora.html>, a
modified version of the Knoppix 3.3 Live CD for students of
corpus-based computational linguistics.
> It doesn't even have to be in XML -- though it will be when I've
> finished with it.
--
Francis Bond <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
NTT Communication Science Laboratories | Machine Translation Research Group
More information about the Corpora
mailing list