<html>

<body>

Google does have pages outlining how to have content removed from its

collections<br><br>

<x-tab>        </x-tab><a href="http://www.google.com/remove.html" eudora="autourl">http://www.google.com/remove.html</a><br><br>

which towards the bottom mentions removal of images because of Digital

Millennium Copyright Act problems.<br><br>

In searching around I also found this web page which seems to imply that

people do want images removed<br><br>

<x-tab>        </x-tab><a href="http://www.chillingeffects.org/dmca512/notice.cgi?NoticeID=565" eudora="autourl">http://www.chillingeffects.org/dmca512/notice.cgi?NoticeID=565</a><br><br>

Now I don't think this happens with text simply because old bits of ASCII

aren't perceived to have as much value as images tend to have.<br><br>

I'm sure one of the reasons why people like TREC and others can negotiate

copyright release deals to build corpora or test collections is that the

owners don't perceive their data has great value and so they are willing

to live with the risk of having the material copied illegally once they

have released it.<br><br>

You'll notice there are very few image test collections with interesting

content because IR people have struggled to find image owners willing to

let their images go.<br><br>

So my feeling is that yes collecting text may be illegal, but it is in

general of so little value (compared to other media) that people are

unlikely to sue you.<br><br>

<br><br>

At 08:54 17/06/2003 -0700, Mark Davies wrote:<br>

<blockquote type=cite class=cite cite>When I was compiling the 100

million word Corpus del Español

(<a href="http://www.corpusdelespanol.org/" eudora="autourl">www.corpusdelespanol.org</a>),

I<br>

consulted two professors from the US who are experts on copyright law, as

applied to the<br>

Internet.  I explained to them that in my corpus, at least, end

users wouldn't have access<br>

to etnire paragraphs of text, much less an entire text itself.  Both

were in agreement<br>

that it would be quite unlikely that there would be any copyright

problems.<br><br>

What has me intrigued with search engines like Google, however, is their

"cached web page"<br>

functionality, in which they are in essnce reproducing an entire web page

-- and all of<br>

the web pages of a given site (assuming no use of robots.txt).  It

seems that this is much<br>

more than the limited context that I ( and others) make available in our

corpora, and yet<br>

there has been no legal challenge.<br><br>

On the other hand, both of the professors who I consulted mentioned that

it's still a very<br>

murky issue with little or no clearly defined legal precedent -- at least

in the US.<br><br>

Mark Davies<br><br>

=================================================<br>

Mark Davies<br>

Assoc. Prof., Spanish Linguistics<br>

Illinois State University<br>

<a href="http://mdavies.for.ilstu.edu/" eudora="autourl">http://mdavies.for.ilstu.edu/</a><br><br>

** Corpus design and use // Web-database scripting **<br>

** Historical and dialectal Spanish and Portuguese syntax **<br>

=================================================</blockquote>

<x-sigsep><p></x-sigsep>

<font face="Courier, Courier">_________________________________________________________________________<br>

Mark Sanderson, Room

303                  

Tel: +44 (0) 114 22 22648<br>

Department of Information

Studies          Fax: +44

(0) 114 27 80300<br>

University of Sheffield, Regent Court,    

<a href="mailto:m.sanderson@shef.ac.uk" eudora="autourl">mailto:m.sanderson@shef.ac.uk</a><br>

211 Portobello St., Sheffield, S1 4DP, UK 

<a href="http://dis.shef.ac.uk/mark/" eudora="autourl">http://dis.shef.ac.uk/mark/</a><br>

_________________________________________________________________________<br>

Good judgement comes from experience, experience comes from bad

judgement<br>

</font></body>

</html>