[Corpora-List] copyright issues
Christian Chiarcos
christian.chiarcos at web.de
Fri Feb 27 12:21:00 UTC 2009
Dear Andreas Kornai,
> getting hit with all kinds of notices, being held liable for vast
> amounts of damages, and in the end getting tarred, feathered, and ran
> out of town. I would be interested in hearing about any such cases.
Talking about newspaper corpora, such things actually happen, but seem to
be normally solved quietly after the corpus distributors get an official
warning in written from. The problem is that often, the rights to use
published newspaper articles are handed over to specialized content
redistribution companies since the outsourcing boom in the 1990s. And
these are quite sensitive about copyright as it is fundamental to their
business model. Examples for German content redistribution companies are
http://www.pressemonitor.de and http://www.vgwort.de.
I know about at least one such case, where a corpus was built and made
accessible without written permission. I think they even got an oral
confirmation to use the data when they started their work, but later the
responsibles at the publishing house couldn't recall, maybe because
responsibilities changed because of internal re-organization. So, years
after, they were confronted with a huge compensation fee (and publication
restrictions, I think, as well).
At another occasion (different people, same publisher), the publisher was
contacted in advance. They explicitly allowed the creation of a corpus,
but only for the time that the project is running. So, this corpus (that
actually already exists) may be neither redistributed nor even stored
beyond this specified date. However, as the corpus will receive only
partial annotations, this is not so problematic, as only the annotated
parts are made available in the end, and in total, this covers less than
15% of the original text. According to (our interpretation of) German
copyright law, this is comparable to illustrative examples as those quoted
in scientific papers and thus legally unproblematic (if the analogy holds).
> sees scholars sued for publishing
> their corpus, the risk seems to be bearable.
The problem is that we never know about economic models of the future. So,
if one day, someone in the management gets the (even misleading)
impression that this data becomes economically relevant, their lawyers
will certainly find you. This actually happened to the people mentioned
above.
The problem is even worse, because it is not entirely clear what counts as
a derived work (annotations ? statistical models trained on these ?), and
to what degree the copyright owner of the original text also receives a
copyright on the derived work. If the corpus data is problematic in its
copyright, then derived works may be problematic as well.
At least for this reason, it's safer to ask for a written agreement from
the publisher stating explicitly what you're allowed to do with the data.
The only legal alternative is to restrict your corpora to illustrative
examples, i.e., to use at most a fraction (e.g., <=15% per document as a
rule of thumb) of the original text.
But even this practice does not guarantee full legal security unless it is
confirmed by some kind of verdict.
Best,
Christian Chiarcos
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list