[Corpora-List] license question
Alexander Paile
alexander.paile at kotus.fi
Fri Aug 18 12:45:40 UTC 2006
Hej Lars,
The corpus consists mainly of Finnish legislation texts and public
annual reports from different companies in Finland. I guess that could,
theoretically speaking, be a problem if somebody wants to be nasty. The
languages in question are Finnish and Swedish. When I was calling around
asking for material many companies just shrugged and sent me what they
had to get rid of me. I'm afraid of scaring them away if I start asking
them to sign papers. Most people don't know what a corpus is and they
couldn't care less. And they don't want to sign papers they don't
understand. On the other hand both the legislation texts and the company
reports are freely available and nobody probably ever thought of
licensing them in any way.
What kind of corpus is it? Well, it's a sentence aligned Finnish-Swedish
parallel corpus of some 4 million words. The markup is CES XML. No
morphosyntactic tagging yet.
Oh, by the way. The sentences in the corpus files don't even necessarily
come in the same order that they did in the original texts. I'm not sure
that has any legal implications. We are thinking LGPL.
cheers
Alexander Paile
Lars Borin wrote:
> Hi Alexander,
>
>> Could you recommend a nice license to publish a parallel corpus under?
>
> Are you (the institute) the copyright holder for all texts in the corpus?
> If not, your hands are tied, as far as I understand, as only the the
> copyright holder (the author of original texts, or [possibly] whoever
> recorded them if they are transcriptions of spoken lg; the translators of
> the translations; or their employers if their job contracts specify this)
> have the legal power to determine how the texts are to be distributed.
>
> What kind of corpus is this? Which language pair?
>
> Best
> Lars Borin
> Språkbanken/Swedish Language Bank
> Dept. of Swedish Language
> Göteborg University
>
>
More information about the Corpora
mailing list