[Corpora-List] Legal aspects of compiling corpora

Fri Jun 13 19:47:20 UTC 2003

Without making the problem more difficult, I want to point out that very
similar problems arise in discourse linguistics, where the objects of study
are connected texts, often necessarily whole texts.

If a researcher wants to make claims about a whole text, for example about
how coherence arises,  it is often necessary to exhibit the whole text so
that such claims are examinable.  And just as for Corpus Linguistics, the
texts cannot be made examinable like sentences in a grammar paper, because
bulk prohibits such large citations.

There has been a lot of implicit reliance on   "fair use,"  accompanied by
circulation on the internet.  It would be hard for discourse linguistics to
achieve open discussion of results and evidence without something like this.
==================

There is another locus of examination which might turn out to be very
relevant.  I know about it, but not the details.  The Oxford Text Archive
promotes the protection and circulation of extensive works.   They put a lot
of effort into these issues, including copyright legalities,  not
diminishing the rights of a contributor of a piece, and not creating
unjustified claims of rights for the Archive itself.

The result is a multipage License agreement that potential submitters agree
to.

They are at http://ota.ahds.ac.uk/ .

I agree with Doug Cooper that we ought to take a stance.  But who is "we"?

Perhaps one of the new departments of corpus science could take leadership
on this.  It would give it an air of professionalism.

Bill Mann

----- Original Message -----
From: "Doug Cooper" <doug at th.net>
To: <corpora at hd.uib.no>
Sent: Friday, June 13, 2003 2:22 PM
Subject: Re: [Corpora-List] Legal aspects of compiling corpora

| At 14:40 13/6/03 +0100, Mark Sanderson wrote:
| >  I think the honest answer is that it is a question with no clear
answer.
|
| Not so clear.  The original query was whether a 100-
| character citation of a text would be a copyright violation.
| Is there a copyright law anywhere that does not grant
| "fair use" rights to this sort of minimal citation in all but
| pathological cases (eg. extremely short texts like song
| lyrics, or perhaps many consecutive citatations of a
| single text)?
|
|   In any case, this question comes up periodically, and the
| response is almost invariably something along the lines of
| 'well, you'll probably get away with it.'
|
|   I am rather surprised that the corpus-using community has
| not come out with a position statement -- not everybody has
| to sign on to it, of course --  that articulates the point of view
| that:
|
|    a) distributing minimal citations of copyrighted texts, and
|    b) allowing public, indirect access to privately held collections
|        of copyrighted texts for statistical purposes
| are:
|    a) a necessary part of corpus linguistics research, and
|    b) believed by CL practitioners to be inherently protected
|     as fair use, particularly in non-profit research contexts.
|
| and perhaps also gives a few examples of what might _not_
| be considered professional conduct; eg. making full texts
| available or easily reconstructed.
|
|   It seems to me that such a statement would be useful in:
|
|    a) helping to clarify that CL applications promote the
|       'Progress of Science;' ie. are a genuine research use;
|    b) helping individual researchers show that they are
|       acting in good faith. in accordance with others in the
|       profession.
|
|   Obviously, a bunch of us getting together and saying that
| black is white won't make it so.  But to the extent that there
| _is_ a possible gray area in the balance between copyright
| and fair use, I think it is important to start to establish our side's
| position as well.
|
|   Doug Cooper
|