[Corpora-List] Re: license question

Sat Aug 19 07:39:41 UTC 2006

Alexander Paile wrote:
> Hej Lars,
> The corpus consists mainly of Finnish legislation texts and public 
> annual reports from different companies in Finland. I guess that could, 
> theoretically speaking, be a problem if somebody wants to be nasty. The 
> languages in question are Finnish and Swedish. When I was calling around 
> asking for material many companies just shrugged and sent me what they 
> had to get rid of me. I'm afraid of scaring them away if I start asking 
> them to sign papers. Most people don't know what a corpus is and they 
> couldn't care less. And they don't want to sign papers they don't 
> understand. On the other hand both the legislation texts and the company 
> reports are freely available and nobody probably ever thought of 
> licensing them in any way.
> 
> What kind of corpus is it? Well, it's a sentence aligned Finnish-Swedish 
> parallel corpus of some 4 million words. The markup is CES XML. No 
> morphosyntactic tagging yet.
> 
> Oh, by the way. The sentences in the corpus files don't even necessarily 
> come in the same order that they did in the original texts. I'm not sure 
> that has any legal implications. We are thinking LGPL.
> 
> cheers
> 
> Alexander Paile
> 

Hi Alexander!

(I've sent a similar post to this list some months ago )

We distribute our parallel corpus under the CC Attribution license. LGPL 
is for software code (for example the contract mentions source code that 
does not make sense for a text corpus).

I think sentence shuffling solves your problem. It's fair use.

Our copyright notice is:

Some raw materials used for the Hunglish corpus are under copyright

(literature, film subtitles, magazines). We prevented the illegal use of 
copyrighted material

by shuffling the texts at sentence level. This form is still useful for 
research purposes,

  while it does not infringe upon the rightholders' interests. If you 
are a copyright holder,

and you consider the shuffled files infringing, please send email and we 
will remove the material

in question from the corpus.

The Hunglish corpus is open for use (with the above restrictions) under 
a creative commons attributions

licence, refer to our publication.

This method can be used for web corpus as well. No URL lists are needed.

peter