[Corpora-List] Legal aspects of compiling corpora

Tue Jun 17 14:38:09 UTC 2003

I have been following this very interesting discussion on legal aspects of
corpus compilation, and I agree with some of the points made, specifically
that
- for plain research purposes, I believe a simple reference to author and
publication details should be enough (and legal). In some countries, I
believe, Copyright law specifies that copyright restrictions (royalties,
etc) do not hold in the case of text used for educational material
production (e.g. inclustion of a poem in a language teaching textbook).

- the problem arises when the corpus is to be used for purposes other than
research, and especially profit making puroposes (NLP products,
dictionaries, etc.).
What we have done for the compilation of the Hellenic National Corpus
(http://corpus.ilsp.gr/ - it is all in Greek!) which is accesible over the
Internet, is to sign agreements with every single source that gave us their
texts (for free) that we will not reproduce or sell their data as such, that
their data will be part of a corpus which will be available for research
only. This means that we are not free to distribute raw data, but simply
access to the data via our tool.

We also sign agreements with all users of the corpus, that bind them to
research usage. We have randomly selected chapters / divisions of the
original texts, in order to limit the possibility of full reproduction by
the users. Furthermore, the users get fragments of text (sentences, or at
the most, paragraphs) and never the whole text.
This is a terribly time and effort consuming job, and it also limits the
source range: texts availble over the Internet but with no obvious author /
publisher / copyright holder are out of the question for us, because we
cannot sign the agreement!

Maria