[Corpora-List] Legal aspects of compiling corpora

Mark Sanderson m.sanderson at sheffield.ac.uk
Fri Jun 13 20:14:45 UTC 2003


There are a number of examples of initiatives like the OTA, the problem is
that researchers want to move from a few gigabytes of text to a few
terabytes, no single organisation has that amount of data as far as I know,
the only way to work with that much is to do things like crawl the Web
where you are pulling down texts created by countless numbers of people and
organisations. I don't see how license agreements might be obtained.

We have a copy of quite a large crawl of Web data and the one thing that
keeps me sleeping at night is the thought that Google has far more money
than me or probably my University, they have a collection many times bigger
than what I have and they haven't (yet) been sued as far as I know. They
are making money from holding the data and I'm not (directly), so I think
it's the kind of thing that is OK to do.

The search engine owners certainly ignored such concerns when engines
started to be built and went ahead and built their collections. It is to
all our benefit that they ignored such questions or worries.

At 15:47 13/06/03 -0400, William Mann wrote:
>Without making the problem more difficult, I want to point out that very
>similar problems arise in discourse linguistics, where the objects of study
>are connected texts, often necessarily whole texts.
>
>If a researcher wants to make claims about a whole text, for example about
>how coherence arises,  it is often necessary to exhibit the whole text so
>that such claims are examinable.  And just as for Corpus Linguistics, the
>texts cannot be made examinable like sentences in a grammar paper, because
>bulk prohibits such large citations.
>
>There has been a lot of implicit reliance on   "fair use,"  accompanied by
>circulation on the internet.  It would be hard for discourse linguistics to
>achieve open discussion of results and evidence without something like this.
>==================
>
>There is another locus of examination which might turn out to be very
>relevant.  I know about it, but not the details.  The Oxford Text Archive
>promotes the protection and circulation of extensive works.   They put a lot
>of effort into these issues, including copyright legalities,  not
>diminishing the rights of a contributor of a piece, and not creating
>unjustified claims of rights for the Archive itself.
>
>The result is a multipage License agreement that potential submitters agree
>to.
>
>They are at http://ota.ahds.ac.uk/ .
>
>I agree with Doug Cooper that we ought to take a stance.  But who is "we"?
>
>Perhaps one of the new departments of corpus science could take leadership
>on this.  It would give it an air of professionalism.
>
>Bill Mann
>
>----- Original Message -----
>From: "Doug Cooper" <doug at th.net>
>To: <corpora at hd.uib.no>
>Sent: Friday, June 13, 2003 2:22 PM
>Subject: Re: [Corpora-List] Legal aspects of compiling corpora
>
>
>| At 14:40 13/6/03 +0100, Mark Sanderson wrote:
>| >  I think the honest answer is that it is a question with no clear
>answer.
>|
>| Not so clear.  The original query was whether a 100-
>| character citation of a text would be a copyright violation.
>| Is there a copyright law anywhere that does not grant
>| "fair use" rights to this sort of minimal citation in all but
>| pathological cases (eg. extremely short texts like song
>| lyrics, or perhaps many consecutive citatations of a
>| single text)?
>|
>|   In any case, this question comes up periodically, and the
>| response is almost invariably something along the lines of
>| 'well, you'll probably get away with it.'
>|
>|   I am rather surprised that the corpus-using community has
>| not come out with a position statement -- not everybody has
>| to sign on to it, of course --  that articulates the point of view
>| that:
>|
>|    a) distributing minimal citations of copyrighted texts, and
>|    b) allowing public, indirect access to privately held collections
>|        of copyrighted texts for statistical purposes
>| are:
>|    a) a necessary part of corpus linguistics research, and
>|    b) believed by CL practitioners to be inherently protected
>|     as fair use, particularly in non-profit research contexts.
>|
>| and perhaps also gives a few examples of what might _not_
>| be considered professional conduct; eg. making full texts
>| available or easily reconstructed.
>|
>|   It seems to me that such a statement would be useful in:
>|
>|    a) helping to clarify that CL applications promote the
>|       'Progress of Science;' ie. are a genuine research use;
>|    b) helping individual researchers show that they are
>|       acting in good faith. in accordance with others in the
>|       profession.
>|
>|   Obviously, a bunch of us getting together and saying that
>| black is white won't make it so.  But to the extent that there
>| _is_ a possible gray area in the balance between copyright
>| and fair use, I think it is important to start to establish our side's
>| position as well.
>|
>|   Doug Cooper
>|



More information about the Corpora mailing list