[Corpora-List] license question

Serge Sharoff S.Sharoff at leeds.ac.uk
Fri Aug 18 16:40:11 UTC 2006


> For example, BBC News online is freely available, but I couldn't crawl
> their site, compile a corpus and then redistribute under a license of my
> choosing!
This is why I advocate the procedure of distributing an Internet-derived corpus as a list of URLs.  The arguments in prior cases against "deep linking" concerned situations with competing services or mistaken identity.  These cases can't apply to corpus distribution.  If the procedure for corpus compilation remains constant, the resulting corpus recompiled on the target computer will be almost the same as the original.  More information on the procedure and tools is available from:
http://corpus.leeds.ac.uk/internet.html

Of course, corpora recompiled from URL lists will drift away from the original version, because webpages either get updated or removed.  For the past year together with Marco Baroni I was measuring the rate of this drift.  The initial impression is that you can tolerate it for many tasks.  However, this drift can be a greater problem for an aligned corpus.  If I add a link to a new project on my home page, its content changes slightly for the purposes of a monolingual corpus.  However, the same change could render the page useless in an aligned parallel corpus.

> Whilst you may argue that a license like LGPL would ensure that the
> corpus remained Free (that is, redistribution must stay under LGPL and
> any modifications, if distributed, must also be released under LGPL) is
> doesn't prevent people from either charging for the corpus or prevent
> its inclusion within a commercial product. This may not be acceptable to
> the copyright holder who originally intended their materials to be used
> within (not-for-profit?) research only.
However, can you solve this problem with an appropriate license derived from http://creativecommons.org/ ?

My 2p,
S



More information about the Corpora mailing list