[Corpora-List] Google Books, copyrights, and corpora

Thu Jun 15 09:27:19 UTC 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

(This message originally only found its way to Chris Brew as my e-mail
software played a trick on me. Here it comes again.)

Chris Brew wrote:

>> The technical question is "Is it possible to reconstruct the full
>> text from snippets of concordance?". The answer to this depends on
>> how snippets are selected. The answer will be "yes" if, for every
>> token in the full text, there is some query that would return that
>> token, along with enough context to allow the snippets to be sewn
>> back together. You would be about as certain that the text was
>> right as you are when you solve a cryptogram. While this is less
>> than complete mathematical certainty, it would probably convince a
>> judge. The answer might be "no" if there are enough tokens that
>> Google can guarantee will never appear in a snippet.

It depends on many things. Reading a whole corpus off from snippets will
in my opinion never be completely impossible but there are things that
can make life of "corpus robbers" difficult.

If you need to login via user name and password and accounts are
distributed manually, then you can keep track of what a user is doing.

(In web applications, there is the concept of sessions, that allows this
but it is easy to cheat on it when you don't have to log in. Anyways,
this concept allows to build tracking information all over a session a
user is working with.)

So if you have

- - manually inspected user accounts
  (no automatically created dummy users.)
- - session management in the web application
  (to keep track.)
- - some decent software that detects "corpus robbers" (e.g. by
  their special behaviour, like submitting many queries in
  a short perioid of time.)  and gives them a nice hint
  + some minutes of  blocking time

... you will have a good chance that the corpus as a whole stays in the
hands of the copyright owners.

BUT: Will this convince the judges in your country? IANAL.

Even though we don't have a Burger King restaurant in the town were I
live I'd take the statement of Mark Line serious. In Germany, we have
quite some lawyers specialised on legal blackmail. Usually people get in
trouble for having song lyrics or copyrighted poems or other
literature-alike stuff online.[1] So why not linguists?

And these people do get rich by suing "poor" people. Serial letters are
cheap and the "poor" guys rather pay the "fee" than their own lawyer.

So for Garage Corpus Computing Inc, this can be an early end.

Some 5¢ from me & myself.

Best,

  Niels Ott

(CL student, Tübingen Univ)

[1]: If there was a German web site containing a public archive of the
corpora list, I could as well bring the site owner in trouble by writing
here: Chancelor M***** is a BEEEEP. Or something.

- --
Me & Myself: http://www.drni.de/niels/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFEkSf3bosnVosUgx0RApY0AKCffXx1bRe2wlTRboW1Udhx+YPlEgCeMos2
2HdfJsKQdkKmMTPclyo9fOg=
=x4bF
-----END PGP SIGNATURE-----