[Corpora-List] Legal aspects of corpora compiling

Adam Kilgarriff adam.kilgarriff at itri.brighton.ac.uk
Tue Oct 1 08:19:39 UTC 2002


On 25 Sept Rafal L. Górski asked:
 >
 > Does anybody know about research on legal aspects of corpora compiling
 > (copyright restrictions).

A short answer:

to be unequivocally, completely, totally in the clear you need to get
copyright clearance from all copyright holders (publishers and/or
authors, all speakers for spoken material).  Some will give it to
you, others won't, and it is a lot of work to gather.   (I attended a
rather nice talk on BNC copyright issues titled "Ladies love lupins".
Sometimes, the only way to get the copyright clearance sought was to
take the lady concerned a bunch of flowers.)

HOWEVER

the law is in its infancy and there is very little which is obviously
right or wrong/legal or illegal.  If you have an enemy with rich
enough lawyers, you will always be found in the wrong (cf Napster -
when you're up against the music business it's apparently illegal even
to tell someone where they might find something) so it's
pointless viewing the law as a set of rules.  Rather, you have to
avoid doing things which someone who is rich and inclined to sue is
going to view as provocative. 

Considerations:

1) PUBLISHING

the issue is heavier if you are going to publish/ copy on the data
than if you are not.  If it's only for in-house use, then one simple
issue is "who will ever know", and it is not clear that, eg,
downloading a report onto you PC's desktop is any different to
downloading it into a corpus. Copyright law is in general about the
case where someone makes money from selling intellectual property: if
you are going to sell a corpus, the issues need taking very seriously,
as people will be upset by you making money out of selling their text
(unless you give them a share).

2) EXTRACT SIZE

the issue is heavier, the larger the extracts you take.  There is a
traditional exemption from copyright for short extracts, so eg you can
take brief quotes, eg in a review or academic book, without asking
permission.  There are different opinions about how much you can
quote.  If you are quoting a short poem, you couldn't quote it all on
the grounds that there weren't many words, so the definition of 'short'
has to do with 'as a proportion of the whole' as well as absolute
length.  As a general principle, keep extracts short. (In one project,
we used "3000 words or one third of the document, whichever is the
shorter") 

3) BE COOPERATIVE

avoid including anything where there is an explicit reason not to.
In the context of the web, 'no robots' convention allows
authors to say they don't want their page to be viewed by
robots.  One should also read this as "keep off" from the point of
view of corpus compilation.  Some literary authors are notoriously
litiginous. 



COURSE:  Michael Rundell and I are teaching a short course on

    "Corpus Design and Use"
     =====================

which will cover legal issues and also

= size, balance, "representativeness"
= text formats, data capture
= text type information
= corpus query programs
= methods and measures for comparing corpora

from linguistic and lexicographic perspectives, in Brighton, England,
Mon 2nd to Thurs 5th December 2002.  Bookings open now!

http://www.itri.brighton.ac.uk/lexicom


Adam

-- 
NEW!! MSc and Short Courses in Lexical Computing and Lexicography
Info at

http://www.itri.brighton.ac.uk/lexicom

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff                                
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road                        
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



More information about the Corpora mailing list