[Corpora-List] Google Books, copyrights, and corpora

Wed Jun 14 15:54:07 UTC 2006

I'd be interested in hearing how Google is going to stop people from 
recreating texts.  My gut feeling is that Google is in the wrong on this 
one.

An anecdote: My old professor of Religious Studies, Martin Abegg, used 
precisely such a concordance to piece together the corpus of Dead Sea 
Scrolls for his Ph.D dissertation.  A private paper concordance had been 
produced by the team in charge of publishing the scrolls; a few copies of 
that concordance were lent to various institutions.  The one that he used 
was freely available on the stacks of the library at Hebrew Union College. 
I remember how he told us that the reason he used the concordance to piece 
together the texts was because he needed just one text, an unpublished one, 
for his dissertation.  After he had assembled the entire corpus of texts 
known at that time, he was strongly encouraged by various people to publish 
all of them, which he eventually did.  He was sued, if memory serves 
correctly, in both an Israeli court and an American one, but I cannot recall 
the outcome of either case.  (Eventually, things worked out for him, as he 
ended up compiling the index volume to the official publication series some 
years later.  A young undergrad, I was paid to check the English 
transliteration of names for the volume.)  Anyway, good luck--and be 
careful.

Nathan Bauman
General English Program,
Sookmyung Women's University
Seoul, South Korea

----- Original Message ----- 
From: "Mark Davies" <Mark_Davies at byu.edu>
To: <corpora at hd.uib.no>
Sent: Thursday, June 15, 2006 12:18 AM
Subject: [Corpora-List] Google Books, copyrights, and corpora

> Most of us are familiar with the Google Books initiative -- the project 
> that will digitize tens of millions of books from several leading 
> libraries (http://books.google.com/intl/en/googlebooks/about.html). Google 
> scans these books and then makes them searchable for end users via the 
> Web.
>
> For copyrighted works, the end users see only a "snippet" view -- similar 
> to what we linguists would call an entry in a KWIC display. This is the 
> line of text containing the word or phrase searched for, and maybe one 
> line of text before and one after.
>
> Google claims that although the entire text is (indexed) on the server, 
> the end user sees only very limited context, and there is therefore no 
> violation of US Fair Use Law. See 
> http://books.google.com/googlebooks/newsviews/legal.html for their legal 
> claims and http://fairuse.stanford.edu/ for US Fair Use Law.
>
> In 2005 Google was sued by the American Association of Publishers, which 
> claimed that the "snippet defense" is not adequate in this case (see 
> http://publishers.org/press/releases.cfm?PressReleaseArticleID=292). The 
> case is still in litigation.
>
> ---
>
> What are the implications of this for corpus creation and use? If Google 
> wins, does it mean that we can include *ANY* texts in a corpus, as long as 
> the end user only has access to short KWIC entries (especially if the 
> search interface prevents them from "chaining" these together to re-create 
> larger strings of text)? I guess I'm interested in this question right 
> now, as I'm considering the legal implications of using a particular text 
> collection (300+ million words) as part of a historical corpus of English.
>
> In the past, we've discussed copyright and we've discussed Google and 
> we've discussed Google copyright issues (see several CORPORA posts in June 
> 2003 relating to cached web pages). But this discussion was before Google 
> announced the Google Books initiative, and before they announced the 
> "snippet defense", which seems to have clear application to what we're 
> doing (or could do) with corpora.
>
> Any comments?
>
> =================================================
> Mark Davies
> Assoc. Prof., Linguistics
> Brigham Young University
> (phone) 801-422-9168 / (fax) 801-422-0906
> http://davies-linguistics.byu.edu
>
> ** Corpus design and use // Linguistic databases **
> ** Historical linguistics // Language variation **
> ** English, Spanish, and Portuguese **
> =================================================
>
>