[Corpora-List] POS-tagger maintenance and improvement

Thu Feb 26 17:45:20 UTC 2009

On Thu, Feb 26, 2009 at 08:29:53AM +0100, Francis Tyers wrote:
> > assembled a large aligned Hungarian-English corpus (Hunglish, LDC2008T01)
> > which is chock-full of copyrighted text, by the simple expedient of merging
> > all text files in one and alphabetically sorting the sentences leaving a
> > portion out, which makes it more labor-intensive to restore the original
> > documents than keying them in.  This method (blessed by the UPenn lawyers)
> 
> Actually, I was thinking of doing something similar, but was lead to
> believe that the text was still copyrighted... even if it was sorted and
> thus couldn't be distributed under a free licence -- for example the BSD
> or LGPL.
> 
> Do you have by any chance a written statement from the UPenn lawyers
> regarding this?

They haven't provided a separate written statement (nor have we asked for
one) but they did explain their reasoning. Let R be the copyright holder of
some work, B be a potential buyer, and M be the maker of the corpus.  The
prime mover behind copyright cases is economic harm. As long as M sells
copyrighted material, or even gives it away, M is taking away the reason of
B to buy from the source that would pay royalties to R, so M is causing
economic harm. Here it is clear that no harm is done, since the users of
your corpus have not actually gained access to the copyrighted work and the
corpus can't be exploited for pirate editions. 

> Actually, I just looked up the licence agreement for the Hunglish
> corpus:
> 
> "1.2. User shall not publish, retransmit, display, redistribute,
> reproduce or commercially exploit the Data in any form, except that User
> may include limited excerpts from the Data in articles, reports and
> other documents describing the results of Userâ€™s linguistic education
> and research. "
> 
> So I guess the answer to my question is no.

This is the generic LDC policy, and again it doesn't enjoin you from the
main goal you'd want to use a corpus for, namely training and testing
computational linguistic models. Whether using the trained system in a
for-profit system would be infringing I'm not sure, IANAL. But the world is
full of systems that were optimized on LDC corpora, probably because these
works, form an economic standpoint, do not harm the copyright holders. From
a legal standpoint I'm not sure, this may even depend on the laws of the
country you are in, but in a large corpus the impact of any single work on
training is so minimal that "de minimis non curat lex" is probably applicable.

So the WSJ could possibly come after you if you used in a commercial system
a model trained only on the WSJ (I say possibly since you still have the
"transformative use" defense) but why would you ever want to do such a
thing?  A pure WSJ model already shows signs of strain on the NYT, and if
your goal is a system that works on journalistic prose you are far better
off training it on a broad mixture of newspaper sources. If, on the other
hand, your goal is to do something value added specifically for WSJ readers,
you should be getting the opinion of WSJ lawyers anyway. 

Andras Kornai, NAL

PS. In the hope of steering back the conversation to Adam's original point,
let me say here that even if one would be inclined to dispute the statement
that the use of some copyrighted work is de minimis, surely corrections to
this work are de minimis! 

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora