[Corpora-List] Re: Google Books, copyrights, and corpora

Fri Jun 16 16:54:28 UTC 2006

> What are the implications of this for corpus creation and use? 
> If Google wins, does it mean that we can include *ANY* texts in a corpus, 
 > as long as the end user only has access to short KWIC entries
 > (especially if the search interface prevents them from "chaining"
 > these together to re-create larger strings of text)?

We've created a parallel corpus of English-Hungarian bitexts and 
published on the web after shuffling the texts:

"Some raw materials used for the Hunglish corpus are under copyright 
(literature, film subtitles, magazines). We prevented the illegal use of 
copyrighted material by shuffling the texts at sentence level. This form 
is still useful for research purposes, while it does not infringe upon 
the rightholders' interests. If you are a copyright holder, and you 
consider the shuffled files infringing, please send email and we will 
remove the material in question from the corpus.

The Hunglish corpus is open for use (with the above restrictions) under 
a creative commons attributions licence."

peter