[Corpora-List] Legal aspects of compiling corpora

Tue Jun 17 15:54:50 UTC 2003

When I was compiling the 100 million word Corpus del Español (www.corpusdelespanol.org), I
consulted two professors from the US who are experts on copyright law, as applied to the
Internet.  I explained to them that in my corpus, at least, end users wouldn't have access
to etnire paragraphs of text, much less an entire text itself.  Both were in agreement
that it would be quite unlikely that there would be any copyright problems.

What has me intrigued with search engines like Google, however, is their "cached web page"
functionality, in which they are in essnce reproducing an entire web page -- and all of
the web pages of a given site (assuming no use of robots.txt).  It seems that this is much
more than the limited context that I ( and others) make available in our corpora, and yet
there has been no legal challenge.

On the other hand, both of the professors who I consulted mentioned that it's still a very
murky issue with little or no clearly defined legal precedent -- at least in the US.

Mark Davies

=================================================
Mark Davies
Assoc. Prof., Spanish Linguistics
Illinois State University
http://mdavies.for.ilstu.edu/

** Corpus design and use // Web-database scripting **
** Historical and dialectal Spanish and Portuguese syntax **
=================================================