[Corpora-List] Legal aspects of compiling corpora

Thu Jun 19 08:49:03 UTC 2003

Dear Corporeans:

For the record, here is my attempt at drafting a basic statement
of professional practice in regard to using text corpora.  It
describes a well-understood (and I hope easily defended)
subset of corpus applications, in service of setting up -- but
not asserting the conclusion of -- the following syllogism:

 - some research use of copyrighted texts is protected by law;
 - here are some ways we use copyrighted texts for research;
 - perhaps our research uses are also protected.

  Please take this in the spirit offered -- as an example of what
a reasonably framed position might look like, and not as an
ideological stance ;-).  I hope it may encourage the articulation
of alternative statements, and/or discussion of the propriety of
taking a position at all.

  Also, given its length, for which I apologize, _please_ don't
automatically include the whole thing in responses.

  The texts of the relevant parts of the US copyright law and
Berne Conventions, and a few of my own comments, follow.

-----------------------------------------------
     A Statement on Research Use of Generic Text Corpora

   This statement on professional practice is intended to help
   researchers act in good faith observance of established
   research practice when assembling or using generic text
   corpora that may include copyrighted materials.

   The statement does not claim that such usage is necessarily
   entitled to a "fair use" or "fair practice" exemption; only that
   the methodology it describes is bona fide research practice.
                   ____________________

A variety of scientific and educational disciplines rely on
studying, or extracting samples from, large bodies of text --
corpora.  We will refer to these as 'generic text corpora'
in order to distinguish them from more specific collections
of particular authors or factual genres (eg. legal decisions).

  Generic corpora almost invariably include copyrighted texts.
However, because of the "fair use" or "fair practice" rights
granted by typical copyright laws (the US law and Berne
Conventions, respectively) the inclusion of copyrighted
texts in generic corpora does not necessarily entail any
copyright violation.

  The exact line between copyright protection and fair
use/practice rights is intentionally vague.  Neither a claim
of copyright, nor of fair use/practice exception, automatically
trumps the other.

  But while we cannot fix an explicit definition of what research
applications will always qualify as fair use/practice, we can
clearly state that certain kinds of use are bona fide research
practices.

  By definition, generic text corpora are not of interest as
either literary or factual works.  Rather, they are inspected
for one of two basic reasons:

 - to investigate text properties through statistical analysis;
 - to extract and cite small examples, typically < 100
   contiguous characters, that elucidate word or phrase
   syntax, semantics, or other lexical features.

  In the first case, text is not necessarily returned at all; rather,
we return overviews of various text properties.  If the
underlying text is revealed, it is only in a purely factual
manner; eg. in lists of word or phrase frequency counts.

  In the second, the researcher or student is only interested
in some factual aspect - typically syntax or semantics - of
this particular arrangement of words; eg. in the citation:

 'single man in possession of a good =>fortune<= must be in want of a wife'

it is a human's ability to understand the semantics of
"fortune" that is of interest, rather than the literary or social
commentary of the context.

  In either case, the contents of a generic text corpus are
not, and cannot be, read as ordinary texts in the course of
research use.  Moreover, making the contents of a generic
text corpus available in a manner that _might_ let its contents
be reconstructed as literary or factual works is not a typical
research application for generic text corpora.

                       END OF STATEMENT
----------------------------------------------------------
Berne Convention Article 10

(1) It shall be permissible to make quotations from a work which
has already been lawfully made available to the public, provided
that their making is compatible with fair practice, and their
extent does not exceed that justified by the purpose, including
quotations from newspaper articles and periodicals in the form of
press summaries.

(2) It shall be a matter for legislation in the countries of the
Union, and for special agreements existing or to be concluded
between them, to permit the utilization, to the extent justified
by the purpose, of literary or artistic works by way of illustration
in publications, broadcasts or sound or visual recordings for
teaching, provided such utilization is compatible with fair practice.

(3) Where use is made of works in accordance with the preceding
paragraphs of this Article, mention shall be made of the
source, and of the name of the author, if it appears thereon.
===========
US Copyright Law Section 107 Limitations on Exclusive Rights: Fair use

  Notwithstanding the provisions of sections 106 and 106A, the
fair use of a copyrighted work, including such use by reproduction
in copies or phonorecords or by any other means specified by that
section, for purposes such as criticism, comment, news reporting,
teaching (including multiple copies for classroom use), scholarship,
or research, is not an infringement of copyright.

  In determining whether the use made of a work in any particular
case is a fair use the factors to be considered shall include -
 (1) the purpose and character of the use, including whether such use
  is of a commercial nature or is for nonprofit educational purposes;
 (2) the nature of the copyrighted work;
 (3) the amount and substantiality of the portion used in relation
  to the copyrighted work as a whole; and
 (4) the effect of the use upon the potential market for or
  value of the copyrighted work.
---------------------------------------------------------
Comment from Doug Cooper:

Now, beating around the bush aside, it seems to me that any
common-sense reading of the relevant sections of the US or
Berne copyright regulations make it clear that providing on-line
access to generic text corpora is protected.

  While the US law is more explicit, a survey of EU laws notes
that in general, 'fair practice' means copying for personal,
scientific, educational, or other private use, etc.  [Eisenchitz, T.
and P. Turner. 1997. _Rights and Responsibilities in the Digital
Age: Problems with Stronger Copyright in an Information Society.
Journal of Information Science, 23(3):209-223,]
NB - I couldn't find the article on-line; however, it appears to
be the canonical citation.

  The key factor under US law is that _all_ the exceptions
under 107 must be taken into account.  Moreover, it appears
to be consistently the case that the _possibility_ of copyright
violation is also only one factor, and may be outweighed by
legitimate fair use applications.

  IMHO, the scorecard for an on-line generic text corpus
used as described above would be:

1. 'purpose and character' - non-commercial research use
   that in general requires transforming (and cannot supercede)
   the original work.
2. 'nature of copyrighted work' - mixed.
3. 'amount and substantiability' - miniscule.
4. 'effect of use on market' - nil; it cannot supercede the
   work, and the financial rewards offered by the 'inclusion
   in generic text corpora' market are presumeably zilch.

  As far as I can tell -- and I have gone to the CNI-CopyRight
and BookPeople mailing lists seeking alternative views to no
avail -- simply putting a copyrighted work into a black box
isn't the issue.  Rather, it's the use to which we put that black
box; ie. the bits of copyrighted text that can be downloaded.

  In closing, I am concerned by suggestions that it is necessary
or even advisable to obtain permissions, and possibly pay
compensation, before using texts in the generic manner
described above.  While this may be a consistent position
for corpus developers who are also publishers, it may
unnecessarily discourage researchers in other environments.

  Best,
  Doug Cooper