[Corpora-List] Copyright question again

Yannick Versley yversley at gmail.com
Tue Jan 6 16:53:38 UTC 2015


Djamé, everyone,

I think that shuffling a corpus and the not-that-different method of
cutting up corpora
into n-gram databases all boil down to a use of corpora that would fall
under the general
intuition of "fair use". Prototypically, this is
(i) the creation that you distribute is not suitable for any us that the
original creator would have had in mind or had in mind, and thus are not in
competition with the original creator
(ii) the parts used are fine enough that the exact source does not matter
or is not identifiable to a casual observer
(iii) the person compiling the dataset has no commercial intent.

In US law, (i), (ii) and (iii) together would constitute a "fair use"
defense, I.e. a defense against claims from the original copyright holder
that you are misusing their works. In German law, for example, (ii) would
mean that the criterion for the protectability ("Schöpfungshöhe") for these
parts of the work is not met. (In the US, people have - unsuccessfully -
tried to sue the publisher of the Mary Poppins film music for the use of
the word "supercalifragilistiexpialidocious").

So, basically, Google and others are at the frilly borders of exceptions to
the copyright law.
The story of sampling in rap music demonstrates how these frilly borders
adjust themselves when commercial interests are in play:
http://www.alternet.org/story/18830/how_copyright_law_changed_hip_hop
Quote: "*The copyright laws didn't really extend into sampling until the
hip-hop artists started getting sued. As a matter of fact, copyright didn't
start catching up with us until Fear of a Black Planet. That's when the
copyrights and everything started becoming stricter because you had a lot
of groups doing it and people were taking whole songs."*

The case law of copyright has to adapt to the technical possibilities. In
other words, as soon as people are prancing around in others' front yard
where it was quiet before, the interpretation of the law may be changed to
keep the undesirables out.

In our case, we've got Google and other search engines, with a vested
interest in maintaining the fact that you can store parts of a page and
show them to users in certain contexts that are independent from the
original use, and you've got the newspapers, who would rather not see
people using their content in any imaginable way without paying them. And
outlets like the Huffington Post, who make a living based on the fact that
you can (manually) synthesize the contents of news into new contents
without falling into the category of "derivative work".

As a thought experiment, think what would happen if automatic text
synthesis techniques (i.e., abstractive multi-document summarization, just
taken much farther) got to the level of the Huffington post, and all the
newspapers could get hold of would be a copyright on sentences or n-grams,
we may see scrambled web corpora disappear as fast as Twitter corpora did
when the company changed their TOS. Or limited to non-commercial use.

But this is just speculation. Currently the legality of it all is only
tested through common sense and not courts, not all borders and not all
legislations are accounted for, and the scrambling approach seems to be
something that you can get your organization's legal department to agree on
without much arguing.

Best wishes,
Yannick

On Tue, Jan 6, 2015 at 4:56 PM, Djamé Seddah <djame.seddah at free.fr> wrote:

> Dear everyone,
> I’ve heard that shuffling a corpus, so that its original sentence order
> cannot be retrieved, is enough and counts as a transformation, thus
> alleviating the risk of potential copyright infringement.
> Can anyone confirm this?
>
> Best and happy new year,
>
> Djamé
>
>
> Le 6 janv. 2015 à 16:04, Mcenery, Tony <a.mcenery at lancaster.ac.uk> a
> écrit :
>
> Thanks to all who have contributed to this thread - I have really enjoyed
> it. Khalid made a passing reference to the UK position - this has recently
> become quite permissive for non-commercial text mining research, but we
> have been debating back and forth in Lancaster exactly what this means for
> corpus linguists. Due to the case-law nature of English Law we won't really
> know until some cases have been brought forward and we are able to see how
> the laws/regulations are to be interpreted, hence Khalid's comment about
> the situation being unclear, I assume. Anyway, for those of you interested
> in the new exceptions to copyright in the UK, you can read all about it
> here:
>
>
> https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/375951/Education_and_Teaching.pdf
>
>
> ------------------------------
> *From:* corpora-bounces at uib.no [corpora-bounces at uib.no] on behalf of Mark
> Davies [Mark_Davies at byu.edu]
> *Sent:* 06 January 2015 13:36
> *To:* corpora at uib.no
> *Subject:* Re: [Corpora-List] Copyright question again
>
> Marc Brysbaert wrote:
>
> >> For what it is worth, in my experience word frequency lists and N-gram
> lists are not a problem.
>
> I agree. I've distributed COCA/COHA word frequency (
> http://www.wordfrequency.info) and n-grams (http://www.ngrams.info) data
> for several years now, and I've never had any issues.
>
> >> The big problem we are encountering is that currently there is no
> guidance about whether corpora can be shared. As a result, nearly all
> corpora assembled remain next to inaccessible, meaning that everyone has to
> collect their own corpus. This is a lot of needless work and also means
> that little cumulative work can be done.
>
> I've also been distributing "full-text" data from 450 million word COCA
> and the 1.9 billion word GloWbE (http://corpus.byu.edu/glowbe) for a
> while now, and again no problems to this point. There is a "twist", though,
> in terms of how the full-text data has been slightly altered to
> avoid copyright problems:
>
> http://corpus.byu.edu/full-text/limitations.asp
>
> ​Best,
>
> Mark D.
>
> ============================================
> Mark Davies
> Professor of Linguistics / Brigham Young University
> http://davies-linguistics.byu.edu/
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> ============================================
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20150106/37a45342/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list