[Corpora-List] copyright issues

Fri Feb 27 12:21:00 UTC 2009

Dear Andreas Kornai,

> getting hit with all kinds of notices, being held liable for vast
> amounts of damages, and in the end getting tarred, feathered, and ran
> out of town. I would be interested in hearing about any such cases.

Talking about newspaper corpora, such things actually happen, but seem to  
be normally solved quietly after the corpus distributors get an official  
warning in written from. The problem is that often, the rights to use  
published newspaper articles are handed over to specialized content  
redistribution companies since the outsourcing boom in the 1990s. And  
these are quite sensitive about copyright as it is fundamental to their  
business model. Examples for German content redistribution companies are  
http://www.pressemonitor.de and http://www.vgwort.de.

I know about at least one such case, where a corpus was built and made  
accessible without written permission. I think they even got an oral  
confirmation to use the data when they started their work, but later the  
responsibles at the publishing house couldn't recall, maybe because  
responsibilities changed because of internal re-organization. So, years  
after, they were confronted with a huge compensation fee (and publication  
restrictions, I think, as well).

At another occasion (different people, same publisher), the publisher was  
contacted in advance. They explicitly allowed the creation of a corpus,  
but only for the time that the project is running. So, this corpus (that  
actually already exists) may be neither redistributed nor even stored  
beyond this specified date. However, as the corpus will receive only  
partial annotations, this is not so problematic, as only the annotated  
parts are made available in the end, and in total, this covers less than  
15% of the original text. According to (our interpretation of) German  
copyright law, this is comparable to illustrative examples as those quoted  
in scientific papers and thus legally unproblematic (if the analogy holds).

> sees scholars sued for publishing
> their corpus, the risk seems to be bearable.

The problem is that we never know about economic models of the future. So,  
if one day, someone in the management gets the (even misleading)  
impression that this data becomes economically relevant, their lawyers  
will certainly find you. This actually happened to the people mentioned  
above.

The problem is even worse, because it is not entirely clear what counts as  
a derived work (annotations ? statistical models trained on these ?), and  
to what degree the copyright owner of the original text also receives a  
copyright on the derived work. If the corpus data is problematic in its  
copyright, then derived works may be problematic as well.

At least for this reason, it's safer to ask for a written agreement from  
the publisher stating explicitly what you're allowed to do with the data.  
The only legal alternative is to restrict your corpora to illustrative  
examples, i.e., to use at most a fraction (e.g., <=15% per document as a  
rule of thumb) of the original text.
But even this practice does not guarantee full legal security unless it is  
confirmed by some kind of verdict.

Best,
Christian Chiarcos

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora