[Corpora-List] Copyright question again
John D Burger
john at mitre.org
Tue Jan 6 16:36:13 UTC 2015
Shuffling is exactly the approach we took with several corpora we shared through the Linguistic Data Consortium:
https://catalog.ldc.upenn.edu/LDC2012T22
https://catalog.ldc.upenn.edu/LDC2013T02
https://catalog.ldc.upenn.edu/LDC2012T23
Both MITRE and the LDC's lawyers signed off on this, for what it's worth.
- John Burger
MITRE
On Jan 6, 2015, at 10:56 , Djamé Seddah <djame.seddah at free.fr> wrote:
> Dear everyone,
> I’ve heard that shuffling a corpus, so that its original sentence order cannot be retrieved, is enough and counts as a transformation, thus alleviating the risk of potential copyright infringement.
> Can anyone confirm this?
>
> Best and happy new year,
>
> Djamé
>
>
>> Le 6 janv. 2015 à 16:04, Mcenery, Tony <a.mcenery at lancaster.ac.uk> a écrit :
>>
>> Thanks to all who have contributed to this thread - I have really enjoyed it. Khalid made a passing reference to the UK position - this has recently become quite permissive for non-commercial text mining research, but we have been debating back and forth in Lancaster exactly what this means for corpus linguists. Due to the case-law nature of English Law we won't really know until some cases have been brought forward and we are able to see how the laws/regulations are to be interpreted, hence Khalid's comment about the situation being unclear, I assume. Anyway, for those of you interested in the new exceptions to copyright in the UK, you can read all about it here:
>>
>> https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf
>>
>>
>> From: corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Mark Davies [Mark_Davies at byu.edu]
>> Sent: 06 January 2015 13:36
>> To: corpora at uib.no
>> Subject: Re: [Corpora-List] Copyright question again
>>
>> Marc Brysbaert wrote:
>>
>> >> For what it is worth, in my experience word frequency lists and N-gram lists are not a problem.
>>
>> I agree. I've distributed COCA/COHA word frequency (http://www.wordfrequency.info) and n-grams (http://www.ngrams.info) data for several years now, and I've never had any issues.
>>
>> >> The big problem we are encountering is that currently there is no guidance about whether corpora can be shared. As a result, nearly all corpora assembled remain next to inaccessible, meaning that everyone has to collect their own corpus. This is a lot of needless work and also means that little cumulative work can be done.
>>
>> I've also been distributing "full-text" data from 450 million word COCA and the 1.9 billion word GloWbE (http://corpus.byu.edu/glowbe) for a while now, and again no problems to this point. There is a "twist", though, in terms of how the full-text data has been slightly altered to avoid copyright problems:
>>
>> http://corpus.byu.edu/full-text/limitations.asp
>>
>> Best,
>>
>> Mark D.
>>
>> ============================================
>> Mark Davies
>> Professor of Linguistics / Brigham Young University
>> http://davies-linguistics.byu.edu/
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list