[Corpora-List] Copyright question again

Djamé Seddah djame.seddah at free.fr
Wed Jan 7 00:19:28 UTC 2015


Thanks Yannick, John and all,
 shuffling looks actually a viable option (obviously not for discourse processing people though) but I’m not sure that my institution will be willing to take that risk (its legal department doesn’t even answer such question anyway).

Best,
Djamé



> Le 6 janv. 2015 à 17:53, Yannick Versley <yversley at gmail.com> a écrit :
> 
> Djamé, everyone,
> 
> I think that shuffling a corpus and the not-that-different method of cutting up corpora
> into n-gram databases all boil down to a use of corpora that would fall under the general
> intuition of "fair use". Prototypically, this is
> (i) the creation that you distribute is not suitable for any us that the original creator would have had in mind or had in mind, and thus are not in competition with the original creator
> (ii) the parts used are fine enough that the exact source does not matter or is not identifiable to a casual observer
> (iii) the person compiling the dataset has no commercial intent.
> 
> In US law, (i), (ii) and (iii) together would constitute a "fair use" defense, I.e. a defense against claims from the original copyright holder that you are misusing their works. In German law, for example, (ii) would mean that the criterion for the protectability ("Schöpfungshöhe") for these parts of the work is not met. (In the US, people have - unsuccessfully - tried to sue the publisher of the Mary Poppins film music for the use of the word "supercalifragilistiexpialidocious").
> 
> So, basically, Google and others are at the frilly borders of exceptions to the copyright law.
> The story of sampling in rap music demonstrates how these frilly borders adjust themselves when commercial interests are in play:
> http://www.alternet.org/story/18830/how_copyright_law_changed_hip_hop <http://www.alternet.org/story/18830/how_copyright_law_changed_hip_hop>
> Quote: "The copyright laws didn't really extend into sampling until the hip-hop artists started getting sued. As a matter of fact, copyright didn't start catching up with us until Fear of a Black Planet. That's when the copyrights and everything started becoming stricter because you had a lot of groups doing it and people were taking whole songs."
> 
> The case law of copyright has to adapt to the technical possibilities. In other words, as soon as people are prancing around in others' front yard where it was quiet before, the interpretation of the law may be changed to keep the undesirables out.
> 
> In our case, we've got Google and other search engines, with a vested interest in maintaining the fact that you can store parts of a page and show them to users in certain contexts that are independent from the original use, and you've got the newspapers, who would rather not see people using their content in any imaginable way without paying them. And outlets like the Huffington Post, who make a living based on the fact that you can (manually) synthesize the contents of news into new contents without falling into the category of "derivative work".
> 
> As a thought experiment, think what would happen if automatic text synthesis techniques (i.e., abstractive multi-document summarization, just taken much farther) got to the level of the Huffington post, and all the newspapers could get hold of would be a copyright on sentences or n-grams, we may see scrambled web corpora disappear as fast as Twitter corpora did when the company changed their TOS. Or limited to non-commercial use.
> 
> But this is just speculation. Currently the legality of it all is only tested through common sense and not courts, not all borders and not all legislations are accounted for, and the scrambling approach seems to be something that you can get your organization's legal department to agree on without much arguing.
> 
> Best wishes,
> Yannick
> 
> On Tue, Jan 6, 2015 at 4:56 PM, Djamé Seddah <djame.seddah at free.fr <mailto:djame.seddah at free.fr>> wrote:
> Dear everyone,
> I’ve heard that shuffling a corpus, so that its original sentence order cannot be retrieved, is enough and counts as a transformation, thus alleviating the risk of potential copyright infringement.  
> Can anyone confirm this?
> 
> Best and happy new year,
> 
> Djamé 
> 
> 
>> Le 6 janv. 2015 à 16:04, Mcenery, Tony <a.mcenery at lancaster.ac.uk <mailto:a.mcenery at lancaster.ac.uk>> a écrit :
>> 
>> Thanks to all who have contributed to this thread - I have really enjoyed it. Khalid made a passing reference to the UK position - this has recently become quite permissive for non-commercial text mining research, but we have been debating back and forth in Lancaster exactly what this means for corpus linguists. Due to the case-law nature of English Law we won't really know until some cases have been brought forward and we are able to see how the laws/regulations are to be interpreted, hence Khalid's comment about the situation being unclear, I assume. Anyway, for those of you interested in the new exceptions to copyright in the UK, you can read all about it here:
>> 
>> https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf <https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf>
>> 
>>  
>> From: corpora-bounces at uib.no <mailto:corpora-bounces at uib.no> [corpora-bounces at uib.no <mailto:corpora-bounces at uib.no>] on behalf of Mark Davies [Mark_Davies at byu.edu <mailto:Mark_Davies at byu.edu>]
>> Sent: 06 January 2015 13:36
>> To: corpora at uib.no <mailto:corpora at uib.no>
>> Subject: Re: [Corpora-List] Copyright question again
>> 
>> Marc Brysbaert wrote:
>> 
>> >> For what it is worth, in my experience word frequency lists and N-gram lists are not a problem. 
>> 
>> I agree. I've distributed COCA/COHA word frequency (http://www.wordfrequency.info <http://www.wordfrequency.info/>) and n-grams (http://www.ngrams.info <http://www.ngrams.info/>) data for several years now, and I've never had any issues.
>> 
>> >> The big problem we are encountering is that currently there is no guidance about whether corpora can be shared. As a result, nearly all corpora assembled remain next to inaccessible, meaning that everyone has to collect their own corpus. This is a lot of needless work and also means that little cumulative work can be done.
>> 
>> I've also been distributing "full-text" data from 450 million word COCA and the 1.9 billion word GloWbE (http://corpus.byu.edu/glowbe <http://corpus.byu.edu/glowbe>) for a while now, and again no problems to this point. There is a "twist", though, in terms of how the full-text data has been slightly altered to avoid copyright problems:
>> 
>> http://corpus.byu.edu/full-text/limitations.asp <http://corpus.byu.edu/full-text/limitations.asp>
>> 
>> ​Best,
>> 
>> Mark D.
>> 
>> ============================================
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> http://davies-linguistics.byu.edu/ <http://davies-linguistics.byu.edu/>
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora <http://mailman.uib.no/options/corpora>
>> Corpora mailing list
>> Corpora at uib.no <mailto:Corpora at uib.no>
>> http://mailman.uib.no/listinfo/corpora <http://mailman.uib.no/listinfo/corpora>
> 
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora <http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no <mailto:Corpora at uib.no>
> http://mailman.uib.no/listinfo/corpora <http://mailman.uib.no/listinfo/corpora>
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20150107/cf73afe2/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list