[Corpora-List] license question

Andy Roberts andyr at comp.leeds.ac.uk
Fri Aug 18 14:53:41 UTC 2006


Alexander,

Lars was spot on regarding the rights of the copyright holders. You
simply can't redistribute without consent, regardless of whether the 
content was freely available or "not bothered".

For example, BBC News online is freely available, but I couldn't crawl
their site, compile a corpus and then redistribute under a license of my
choosing!

Whilst you may argue that a license like LGPL would ensure that the
corpus remained Free (that is, redistribution must stay under LGPL and
any modifications, if distributed, must also be released under LGPL) is
doesn't prevent people from either charging for the corpus or prevent
its inclusion within a commercial product. This may not be acceptable to
the copyright holder who originally intended their materials to be used
within (not-for-profit?) research only.

I'm afraid you'll have to go back to the copyright owners and ask for
consent.

Regards,
Andy

On Fri, 18 Aug 2006, Alexander Paile wrote:

> Hej Lars,
> The corpus consists mainly of Finnish legislation texts and public annual 
> reports from different companies in Finland. I guess that could, 
> theoretically speaking, be a problem if somebody wants to be nasty. The 
> languages in question are Finnish and Swedish. When I was calling around 
> asking for material many companies just shrugged and sent me what they had to 
> get rid of me. I'm afraid of scaring them away if I start asking them to sign 
> papers. Most people don't know what a corpus is and they couldn't care less. 
> And they don't want to sign papers they don't understand. On the other hand 
> both the legislation texts and the company reports are freely available and 
> nobody probably ever thought of licensing them in any way.
>
> What kind of corpus is it? Well, it's a sentence aligned Finnish-Swedish 
> parallel corpus of some 4 million words. The markup is CES XML. No 
> morphosyntactic tagging yet.
>
> Oh, by the way. The sentences in the corpus files don't even necessarily come 
> in the same order that they did in the original texts. I'm not sure that has 
> any legal implications. We are thinking LGPL.
>
> cheers
>
> Alexander Paile
>
>
> Lars Borin wrote:
>> Hi Alexander,
>> 
>>> Could you recommend a nice license to publish a parallel corpus under?
>> 
>> Are you (the institute) the copyright holder for all texts in the corpus?
>> If not, your hands are tied, as far as I understand, as only the the
>> copyright holder (the author of original texts, or [possibly] whoever
>> recorded them if they are transcriptions of spoken lg; the translators of
>> the translations; or their employers if their job contracts specify this)
>> have the legal power to determine how the texts are to be distributed.
>> 
>> What kind of corpus is this? Which language pair?
>> 
>> Best
>> Lars Borin
>> Språkbanken/Swedish Language Bank
>> Dept. of Swedish Language
>> Göteborg University
>> 
>> 
>


More information about the Corpora mailing list