<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Thanks Yannick, John and all,<div class=""> shuffling looks actually a viable option (obviously not for discourse processing people though) but I’m not sure that my institution will be willing to take that risk (its legal department doesn’t even answer such question anyway).<div class=""><br class=""></div><div class="">Best,</div><div class="">Djamé</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""><div class=""><div class=""><div><blockquote type="cite" class=""><div class="">Le 6 janv. 2015 à 17:53, Yannick Versley <<a href="mailto:yversley@gmail.com" class="">yversley@gmail.com</a>> a écrit :</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Djamé, everyone,<div class=""><br class=""></div><div class="">I think that shuffling a corpus and the not-that-different method of cutting up corpora</div><div class="">into n-gram databases all boil down to a use of corpora that would fall under the general</div><div class="">intuition of "fair use". Prototypically, this is</div><div class="">(i) the creation that you distribute is not suitable for any us that the original creator would have had in mind or had in mind, and thus are not in competition with the original creator</div><div class="">(ii) the parts used are fine enough that the exact source does not matter or is not identifiable to a casual observer</div><div class="">(iii) the person compiling the dataset has no commercial intent.</div><div class=""><br class=""></div><div class="">In US law, (i), (ii) and (iii) together would constitute a "fair use" defense, I.e. a defense against claims from the original copyright holder that you are misusing their works. In German law, for example, (ii) would mean that the criterion for the protectability ("Schöpfungshöhe") for these parts of the work is not met. (In the US, people have - unsuccessfully - tried to sue the publisher of the Mary Poppins film music for the use of the word "supercalifragilistiexpialidocious").</div><div class=""><br class=""></div><div class="">So, basically, Google and others are at the frilly borders of exceptions to the copyright law.</div><div class="">The story of sampling in rap music demonstrates how these frilly borders adjust themselves when commercial interests are in play:</div><div class=""><a href="http://www.alternet.org/story/18830/how_copyright_law_changed_hip_hop" target="_blank" class="">http://www.alternet.org/story/18830/how_copyright_law_changed_hip_hop</a><br class=""></div><div class="">Quote: "<i class="">The copyright laws didn't really extend into sampling until the hip-hop artists started getting sued. As a matter of fact, copyright didn't start catching up with us until Fear of a Black Planet. That's when the copyrights and everything started becoming stricter because you had a lot of groups doing it and people were taking whole songs."</i></div><div class=""><br class=""></div><div class="">The case law of copyright has to adapt to the technical possibilities. In other words, as soon as people are prancing around in others' front yard where it was quiet before, the interpretation of the law may be changed to keep the undesirables out.</div><div class=""><br class=""></div><div class="">In our case, we've got Google and other search engines, with a vested interest in maintaining the fact that you can store parts of a page and show them to users in certain contexts that are independent from the original use, and you've got the newspapers, who would rather not see people using their content in any imaginable way without paying them. And outlets like the Huffington Post, who make a living based on the fact that you can (manually) synthesize the contents of news into new contents without falling into the category of "derivative work".</div><div class=""><br class=""></div><div class="">As a thought experiment, think what would happen if automatic text synthesis techniques (i.e., abstractive multi-document summarization, just taken much farther) got to the level of the Huffington post, and all the newspapers could get hold of would be a copyright on sentences or n-grams, we may see scrambled web corpora disappear as fast as Twitter corpora did when the company changed their TOS. Or limited to non-commercial use.</div><div class=""><br class=""></div><div class="">But this is just speculation. Currently the legality of it all is only tested through common sense and not courts, not all borders and not all legislations are accounted for, and the scrambling approach seems to be something that you can get your organization's legal department to agree on without much arguing.</div><div class=""><br class=""></div><div class="">Best wishes,</div><div class="">Yannick</div></div><div class="gmail_extra"><br class=""><div class="gmail_quote">On Tue, Jan 6, 2015 at 4:56 PM, Djamé Seddah <span dir="ltr" class=""><<a href="mailto:djame.seddah@free.fr" target="_blank" class="">djame.seddah@free.fr</a>></span> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="">Dear everyone,<div class="">I’ve heard that shuffling a corpus, so that its original sentence order cannot be retrieved, is enough and counts as a transformation, thus alleviating the risk of potential copyright infringement.  </div><div class="">Can anyone confirm this?</div><div class=""><br class=""></div><div class="">Best and happy new year,</div><div class=""><br class=""></div><div class="">Djamé </div><div class=""><br class=""></div><div class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">Le 6 janv. 2015 à 16:04, Mcenery, Tony <<a href="mailto:a.mcenery@lancaster.ac.uk" target="_blank" class="">a.mcenery@lancaster.ac.uk</a>> a écrit :</div><br class=""><div class=""><div class=""><div class="h5"><div style="font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);direction:ltr;font-family:Tahoma;font-size:10pt" class=""><font face="Tahoma, Geneva, sans-serif" class="">Thanks to all who have contributed to this thread - I have really enjoyed it. Khalid made a passing reference to the UK position - this has recently become quite permissive for non-commercial text mining research, but we have been debating back and forth in Lancaster exactly what this means for corpus linguists. Due to the case-law nature of English Law we won't really know until some cases have been brought forward and we are able to see how the laws/regulations are to be interpreted, hence Khalid's comment about the situation being unclear, I assume. Anyway, for those of you interested in the new exceptions to copyright in the UK, you can read all about it here:</font><div style="font-family:Tahoma,Geneva,sans-serif;font-size:10pt" class=""><br class=""></div><div class=""><font face="Tahoma, Geneva, sans-serif" class=""><a href="https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf" style="color:purple;text-decoration:underline" target="_blank" class="">https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf</a></font><br class=""><div style="font-family:Tahoma,Geneva,sans-serif;font-size:10pt" class=""><br class=""><div style="font-family:Tahoma;font-size:13px" class=""> </div></div><div style="font-family:'Times New Roman';font-size:16px" class=""><hr class=""><div style="direction:ltr" class=""><font face="Tahoma" class=""><b class="">From:</b><span class=""> </span><a href="mailto:corpora-bounces@uib.no" style="color:purple;text-decoration:underline" target="_blank" class="">corpora-bounces@uib.no</a><span class=""> </span>[<a href="mailto:corpora-bounces@uib.no" style="color:purple;text-decoration:underline" target="_blank" class="">corpora-bounces@uib.no</a>] on behalf of Mark Davies [<a href="mailto:Mark_Davies@byu.edu" style="color:purple;text-decoration:underline" target="_blank" class="">Mark_Davies@byu.edu</a>]<br class=""><b class="">Sent:</b><span class=""> </span>06 January 2015 13:36<br class=""><b class="">To:</b><span class=""> </span><a href="mailto:corpora@uib.no" style="color:purple;text-decoration:underline" target="_blank" class="">corpora@uib.no</a><br class=""><b class="">Subject:</b><span class=""> </span>Re: [Corpora-List] Copyright question again<br class=""></font><br class=""></div><div class=""></div><div class=""><div style="margin-top:0px;margin-bottom:0px" class="">Marc Brysbaert wrote:<br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class="">>> <span style="color:rgb(31,73,125);font-family:Calibri,sans-serif;font-size:11pt" class="">For what it is worth, in my experience word frequency lists and N-gram lists are not a problem. </span></div><div class=""><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)" class=""><br class=""></span></div><div class=""><span style="font-size:11pt;font-family:Calibri,sans-serif;color:rgb(31,73,125)" class="">I agree. I've distributed COCA/COHA word frequency (<a href="http://www.wordfrequency.info/" style="color:purple;text-decoration:underline" target="_blank" class="">http://www.wordfrequency.info</a>) and n-grams (<a href="http://www.ngrams.info/" style="color:purple;text-decoration:underline" target="_blank" class="">http://www.ngrams.info</a>) data for several years now, and I've never had any issues.</span></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class="">>> <span style="color:rgb(31,73,125);font-family:Calibri,sans-serif;font-size:15px;background-color:rgb(255,255,255)" class="">The big problem we are encountering is that currently there is no guidance about whether corpora can be shared. As a result, nearly all corpora assembled remain next to inaccessible, meaning that everyone has to collect their own corpus. This is a lot of needless work and also means that little cumulative work can be done.</span><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class="">I've also been distributing "full-text" data from 450 million word COCA and the 1.9 billion word GloWbE (<a href="http://corpus.byu.edu/glowbe" style="color:purple;text-decoration:underline" target="_blank" class="">http://corpus.byu.edu/glowbe</a>) for a while now, and again no problems to this point. There is a "twist", though, in terms of how the full-text data has been slightly altered to avoid copyright problems:<br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><a href="http://corpus.byu.edu/full-text/limitations.asp" style="color:purple;text-decoration:underline" target="_blank" class="">http://corpus.byu.edu/full-text/limitations.asp</a><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class="">Best,<br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div style="margin-top:0px;margin-bottom:0px" class="">Mark D.<br class=""></div><div style="margin-top:0px;margin-bottom:0px" class=""><br class=""></div><div class=""><div style="font-family:Tahoma;font-size:13px" class=""><div style="font-family:Tahoma;font-size:13px" class=""><div style="margin-top:0px;margin-bottom:0px" class="">============================================<br class="">Mark Davies<br class="">Professor of Linguistics / Brigham Young University<br class=""><a href="http://davies-linguistics.byu.edu/" style="color:purple;text-decoration:underline" target="_blank" class="">http://davies-linguistics.byu.edu/</a></div><div style="margin-top:0px;margin-bottom:0px" class="">** Corpus design and use // Linguistic databases **<br class="">** Historical linguistics // Language variation **<br class="">** English, Spanish, and Portuguese **<br class="">============================================<br class=""></div></div></div></div></div></div></div></div></div></div><span class=""><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);float:none;display:inline!important" class="">_______________________________________________</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" class=""><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);float:none;display:inline!important" class="">UNSUBSCRIBE from this page:<span class=""> </span></span><a href="http://mailman.uib.no/options/corpora" style="color:purple;text-decoration:underline;font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" target="_blank" class="">http://mailman.uib.no/options/corpora</a><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" class=""><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);float:none;display:inline!important" class="">Corpora mailing list</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" class=""><a href="mailto:Corpora@uib.no" style="color:purple;text-decoration:underline;font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" target="_blank" class="">Corpora@uib.no</a><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" class=""><a href="http://mailman.uib.no/listinfo/corpora" style="color:purple;text-decoration:underline;font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" target="_blank" class="">http://mailman.uib.no/listinfo/corpora</a><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255)" class=""></span></div></blockquote></div><br class=""></div></div><br class="">_______________________________________________<br class="">

UNSUBSCRIBE from this page: <a href="http://mailman.uib.no/options/corpora" target="_blank" class="">http://mailman.uib.no/options/corpora</a><br class="">

Corpora mailing list<br class="">

<a href="mailto:Corpora@uib.no" class="">Corpora@uib.no</a><br class="">

<a href="http://mailman.uib.no/listinfo/corpora" target="_blank" class="">http://mailman.uib.no/listinfo/corpora</a><br class="">

<br class=""></blockquote></div><br class=""></div>

</div></blockquote></div><br class=""></div></div></div></div></body></html>