[Corpora-List] Copyright question again

John D Burger john at mitre.org
Tue Jan 6 16:36:13 UTC 2015


Shuffling is exactly the approach we took with several corpora we shared through the Linguistic Data Consortium:

  https://catalog.ldc.upenn.edu/LDC2012T22
  https://catalog.ldc.upenn.edu/LDC2013T02
  https://catalog.ldc.upenn.edu/LDC2012T23

Both MITRE and the LDC's lawyers signed off on this, for what it's worth.

- John Burger
  MITRE

On Jan 6, 2015, at 10:56 , Djamé Seddah <djame.seddah at free.fr> wrote:

> Dear everyone,
> I’ve heard that shuffling a corpus, so that its original sentence order cannot be retrieved, is enough and counts as a transformation, thus alleviating the risk of potential copyright infringement.  
> Can anyone confirm this?
> 
> Best and happy new year,
> 
> Djamé 
> 
> 
>> Le 6 janv. 2015 à 16:04, Mcenery, Tony <a.mcenery at lancaster.ac.uk> a écrit :
>> 
>> Thanks to all who have contributed to this thread - I have really enjoyed it. Khalid made a passing reference to the UK position - this has recently become quite permissive for non-commercial text mining research, but we have been debating back and forth in Lancaster exactly what this means for corpus linguists. Due to the case-law nature of English Law we won't really know until some cases have been brought forward and we are able to see how the laws/regulations are to be interpreted, hence Khalid's comment about the situation being unclear, I assume. Anyway, for those of you interested in the new exceptions to copyright in the UK, you can read all about it here:
>> 
>> https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf
>> 
>>  
>> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Mark Davies [Mark_Davies at byu.edu]
>> Sent: 06 January 2015 13:36
>> To: corpora at uib.no
>> Subject: Re: [Corpora-List] Copyright question again
>> 
>> Marc Brysbaert wrote:
>> 
>> >> For what it is worth, in my experience word frequency lists and N-gram lists are not a problem. 
>> 
>> I agree. I've distributed COCA/COHA word frequency (http://www.wordfrequency.info) and n-grams (http://www.ngrams.info) data for several years now, and I've never had any issues.
>> 
>> >> The big problem we are encountering is that currently there is no guidance about whether corpora can be shared. As a result, nearly all corpora assembled remain next to inaccessible, meaning that everyone has to collect their own corpus. This is a lot of needless work and also means that little cumulative work can be done.
>> 
>> I've also been distributing "full-text" data from 450 million word COCA and the 1.9 billion word GloWbE (http://corpus.byu.edu/glowbe) for a while now, and again no problems to this point. There is a "twist", though, in terms of how the full-text data has been slightly altered to avoid copyright problems:
>> 
>> http://corpus.byu.edu/full-text/limitations.asp
>> 
>> ​Best,
>> 
>> Mark D.
>> 
>> ============================================
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> http://davies-linguistics.byu.edu/
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
> 
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list