[Corpora-List] Legal aspects of compiling corpora

Tue Jun 17 15:14:22 UTC 2003

Torzec Nicolas ATER LSI wrote:

> Dear Linguists and Lawyers,
> I have got the same "problem" with a large (tagged) monitor corpus of
> texts from french written on-line forums :
> - these messages are publically available in the sense that everybody
> can read and reuse them The key term here is "publicly" available.
> - each newsgroup server stores and uses its own copies of them
> - search engines use and exploit cached copies of them
> - ...
>
> So,
> - It is an illegal procedure to store these messages - in an anonymous
> way - in a database ? Why should it be illegal if none of the participants are
> identified?  I have also downloaded and stored hundreds of chat messages from
> Bulletin Boards and "notified" the owners of the bulletin boards.  Fortunately,
> one had deleted all its messages when it changed its format.  I do not delete
> politicians' names.  In the US, you can write and say things about people in
> public office and they cannot sue you unless you deliberately accuse someone of
> stealing or doing something improper without any proof.  If you defame them
> knowing that what you are saying is false, they can certainly sue you for slander
> and libel.
> - It is an illegal procedure to exploit this corpus for research
> purposes ? (i.e. to realise linguistic studies and to develop NLP
> processing using corpus-based machine learning methods)  This is falls under fair
> use, at least in the US.
> - It is an illegal procedure to illustrate scientific articles with
> examples from this corpus ? You need a lawyer to clarify this.
>
> Do I need to ask permission for each author to store and use its
> messages ? What if I mention the source and the author ? What about the
> copyrights? If you identify the chat list/Bulletin Board and use the
> participants' real names, you ought to ask permission to do so.  Copyrights are
> usually held by the owners of the chat list or bulletin board.
>
> Moreover,
> - What if I want to make my corpus publically available for researchers
> ?
> - What if NLP processing developed from this corpus are to be integrated
> in commercial products ? This is where things become problematic.  I am all in
> favor of  "open architecture" and sharing knowledge, but when people decide to
> charge for their products, we have all kinds of problems.  (The "greed" or profit
> factor.)  I would prefer to create my own "specialized corpus" and share my
> findings with others.  Unfortunately, you cannot "generalize" findings based on
> specialized corpora.
>
> Thank you in advances for your help...
> References, pointers and suggestions are welcome, especially for the
> legal aspects for France...  Sorry, I know nothing about French copyright laws.
>
> Nicholas Torzec
>
> --
> Nicolas Torzec
> PhD Student in NLP processing
> --
>
> delucca at nilc.icmc.usp.br wrote:
> >
> > Dear Linguists and Lawyers,
> >
> > I am troubled with Legal aspects of corpora compiling. I am in
> > doubt if is an illegal procedure storage webpages (or part of them)
> > in a database (see at http://www.dictionarium.com/project.htm),
> > not available to public, and display its contents as short collocations
> > less than 100 characters by time by search method.
> >
> > On the other hand, the Internet search engines uses cached (temporary ?)
> > copies of the sites and display a short of the web pages.
> >
> > My procedure is wrong? Which the Legal difference? I need ask permission
> > for each website to storage its pages? If I mention the source and the author
> > I will be protecting the copyrights?
> >
> >
> > I look forward to hearing from you.
> >
> > Yours Sincerely,
> >
> > J. L. De Lucca
> >
> > -------------------------------------------------
> > This mail sent through IMP: http://horde.org/imp/
>
> --
> Nicolas TORZEC
>
> ENSSAT / Université de Rennes 1
> 6, rue de Kerampont
> 22300 Lannion
>
> Mel : nicolas.torzec at enssat.fr
> Tel : 02.96.46.27.30
> Fax : 02.96.37.01.99
> Web : http://www.enssat.fr
> --
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030617/58c55973/attachment.htm>