[Corpora-List] Legal aspects of compiling corpora

Tue Jun 17 14:51:51 UTC 2003

Google does have pages outlining how to have content removed from its 
collections

         http://www.google.com/remove.html

which towards the bottom mentions removal of images because of Digital 
Millennium Copyright Act problems.

In searching around I also found this web page which seems to imply that 
people do want images removed

         http://www.chillingeffects.org/dmca512/notice.cgi?NoticeID=565

Now I don't think this happens with text simply because old bits of ASCII 
aren't perceived to have as much value as images tend to have.

I'm sure one of the reasons why people like TREC and others can negotiate 
copyright release deals to build corpora or test collections is that the 
owners don't perceive their data has great value and so they are willing to 
live with the risk of having the material copied illegally once they have 
released it.

You'll notice there are very few image test collections with interesting 
content because IR people have struggled to find image owners willing to 
let their images go.

So my feeling is that yes collecting text may be illegal, but it is in 
general of so little value (compared to other media) that people are 
unlikely to sue you.

At 08:54 17/06/2003 -0700, Mark Davies wrote:
>When I was compiling the 100 million word Corpus del Español 
>(www.corpusdelespanol.org), I
>consulted two professors from the US who are experts on copyright law, as 
>applied to the
>Internet.  I explained to them that in my corpus, at least, end users 
>wouldn't have access
>to etnire paragraphs of text, much less an entire text itself.  Both were 
>in agreement
>that it would be quite unlikely that there would be any copyright problems.
>
>What has me intrigued with search engines like Google, however, is their 
>"cached web page"
>functionality, in which they are in essnce reproducing an entire web page 
>-- and all of
>the web pages of a given site (assuming no use of robots.txt).  It seems 
>that this is much
>more than the limited context that I ( and others) make available in our 
>corpora, and yet
>there has been no legal challenge.
>
>On the other hand, both of the professors who I consulted mentioned that 
>it's still a very
>murky issue with little or no clearly defined legal precedent -- at least 
>in the US.
>
>Mark Davies
>
>=================================================
>Mark Davies
>Assoc. Prof., Spanish Linguistics
>Illinois State University
>http://mdavies.for.ilstu.edu/
>
>** Corpus design and use // Web-database scripting **
>** Historical and dialectal Spanish and Portuguese syntax **
>=================================================

_________________________________________________________________________
Mark Sanderson, Room 303                   Tel: +44 (0) 114 22 22648
Department of Information Studies          Fax: +44 (0) 114 27 80300
University of Sheffield, Regent Court,     mailto:m.sanderson at shef.ac.uk
211 Portobello St., Sheffield, S1 4DP, UK  http://dis.shef.ac.uk/mark/
_________________________________________________________________________
Good judgement comes from experience, experience comes from bad judgement
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030617/53cc7065/attachment.htm>