[Corpora-List] license question

Fri Aug 18 12:45:40 UTC 2006

Hej Lars,
The corpus consists mainly of Finnish legislation texts and public 
annual reports from different companies in Finland. I guess that could, 
theoretically speaking, be a problem if somebody wants to be nasty. The 
languages in question are Finnish and Swedish. When I was calling around 
asking for material many companies just shrugged and sent me what they 
had to get rid of me. I'm afraid of scaring them away if I start asking 
them to sign papers. Most people don't know what a corpus is and they 
couldn't care less. And they don't want to sign papers they don't 
understand. On the other hand both the legislation texts and the company 
reports are freely available and nobody probably ever thought of 
licensing them in any way.

What kind of corpus is it? Well, it's a sentence aligned Finnish-Swedish 
parallel corpus of some 4 million words. The markup is CES XML. No 
morphosyntactic tagging yet.

Oh, by the way. The sentences in the corpus files don't even necessarily 
come in the same order that they did in the original texts. I'm not sure 
that has any legal implications. We are thinking LGPL.

cheers

Alexander Paile

Lars Borin wrote:
> Hi Alexander,
> 
>> Could you recommend a nice license to publish a parallel corpus under?
> 
> Are you (the institute) the copyright holder for all texts in the corpus?
> If not, your hands are tied, as far as I understand, as only the the
> copyright holder (the author of original texts, or [possibly] whoever
> recorded them if they are transcriptions of spoken lg; the translators of
> the translations; or their employers if their job contracts specify this)
> have the legal power to determine how the texts are to be distributed.
> 
> What kind of corpus is this? Which language pair?
> 
> Best
> Lars Borin
> Språkbanken/Swedish Language Bank
> Dept. of Swedish Language
> Göteborg University
> 
>