[Corpora-List] Legal aspects of compiling corpora
Mark Sanderson
m.sanderson at sheffield.ac.uk
Tue Jun 17 14:51:51 UTC 2003
Google does have pages outlining how to have content removed from its
collections
http://www.google.com/remove.html
which towards the bottom mentions removal of images because of Digital
Millennium Copyright Act problems.
In searching around I also found this web page which seems to imply that
people do want images removed
http://www.chillingeffects.org/dmca512/notice.cgi?NoticeID=565
Now I don't think this happens with text simply because old bits of ASCII
aren't perceived to have as much value as images tend to have.
I'm sure one of the reasons why people like TREC and others can negotiate
copyright release deals to build corpora or test collections is that the
owners don't perceive their data has great value and so they are willing to
live with the risk of having the material copied illegally once they have
released it.
You'll notice there are very few image test collections with interesting
content because IR people have struggled to find image owners willing to
let their images go.
So my feeling is that yes collecting text may be illegal, but it is in
general of so little value (compared to other media) that people are
unlikely to sue you.
At 08:54 17/06/2003 -0700, Mark Davies wrote:
>When I was compiling the 100 million word Corpus del Español
>(www.corpusdelespanol.org), I
>consulted two professors from the US who are experts on copyright law, as
>applied to the
>Internet. I explained to them that in my corpus, at least, end users
>wouldn't have access
>to etnire paragraphs of text, much less an entire text itself. Both were
>in agreement
>that it would be quite unlikely that there would be any copyright problems.
>
>What has me intrigued with search engines like Google, however, is their
>"cached web page"
>functionality, in which they are in essnce reproducing an entire web page
>-- and all of
>the web pages of a given site (assuming no use of robots.txt). It seems
>that this is much
>more than the limited context that I ( and others) make available in our
>corpora, and yet
>there has been no legal challenge.
>
>On the other hand, both of the professors who I consulted mentioned that
>it's still a very
>murky issue with little or no clearly defined legal precedent -- at least
>in the US.
>
>Mark Davies
>
>=================================================
>Mark Davies
>Assoc. Prof., Spanish Linguistics
>Illinois State University
>http://mdavies.for.ilstu.edu/
>
>** Corpus design and use // Web-database scripting **
>** Historical and dialectal Spanish and Portuguese syntax **
>=================================================
_________________________________________________________________________
Mark Sanderson, Room 303 Tel: +44 (0) 114 22 22648
Department of Information Studies Fax: +44 (0) 114 27 80300
University of Sheffield, Regent Court, mailto:m.sanderson at shef.ac.uk
211 Portobello St., Sheffield, S1 4DP, UK http://dis.shef.ac.uk/mark/
_________________________________________________________________________
Good judgement comes from experience, experience comes from bad judgement
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030617/53cc7065/attachment.htm>
More information about the Corpora
mailing list