<html>

<body>

At Sunday 15/06/2003 10:20(), Mark Sanderson wrote:<br>

<blockquote type=cite class=cite cite>I think the problem though is that

as great as ELRA and LDC are, I don't see how these agencies are going to

help researchers who want access to very large corpora: Google has

something like 30 terabytes of text. One can access it through their API,

but some researchers may wish to have better quality access to such

amounts data. The only way I think this can be done is to have

researchers build their own Web collection, which does take on potential

legal problems, but I don't see an alternative, if one wishes to access

very large corpora.</blockquote><br><br>

then we will face another problem of comparing approaches and techniques,

if each of us use different corpora (without any possibility to share it

with others because of the legal aspects) then no comparison will be

possible.<br><br>

I am sure ELRA and LDC would be happy to launch a terabyte corpus project

and gather the pieces that researchers get from various providers and

negotiate with those. In some cases owners are very cooperative (for

instance the European office of publications, many newspapers, united

nations, UNESCO, etc.). Some of these own gigabytes of data and may be

with 100-200 providers we can get the Tera (and increase this over

time).<br><br>

If the community feels this is what should be done I can bring this up to

ELRA and LDC boards and see how we can get funding for it. <br>

All the best<br>

Khalid Choukri<br><br>

<br>

<blockquote type=cite class=cite cite>I had some involvement in pulling

together the 10Gb corpus for TREC a few years ago and one of the problems

that seemed to be happening was that organisers were struggling to find

owners of text who could could be negotiated with to provide us with

enough text to build a multi-gigabyte corpus. When TREC wanted to go for

bigger data sets, my impression is that, because of this problem, they

had to move to the Web.<br><br>

At 09:37 15/06/03 +0200, Khalid CHOUKRI wrote:<br>

<blockquote type=cite class=cite cite>Dear Colleagues<br><br>

I am not sure the way suggested by Adam is the right thing to do, <br>

ELRA and LDC have been trying to negotiate such rights in order to

provide the users of corpora with good and legally-cleared resources and

we are happy to help request and get the authorizations to use the data

in a more sound and clean legal context.<br><br>

of course we can all do whatever we feel fair and then if no one sue us

it is fine but just imagine that after your 5 years work someone come

across your publication in which you refer to the data and managed to

prevent you from such reference/publication or even ask you to delete all

such data .. <br><br>

Best regards<br>

Khalid CHOUKRI<br>

European Language Resources Association<br><br>

<br><br>

At Friday 13/06/2003 15:36(), Adam Kilgarriff wrote:<br>

<blockquote type=cite class=cite cite>On the one hand, if your enemies

are rich enough you'll lose.<br><br>

On the other you're probably less worth sueing than Google and they

are<br>

still going strong (anyone out there from Google?  Your

contribution<br>

most welcome), and it doesn't sound like you are doing anything with

any<br>

salient legal difference.   (Getting authors' agreements takes

huge<br>

amounts of resources and isn't feasible; listing references doesn't<br>

help.)<br><br>

People do get unhappy about their pictures and audio being grabbed

from<br>

the web for use in other people's databases, and I have heard of

cases<br>

of web developers having to rein in their ambitions because

objections<br>

have been made.  As yet, mercifully, that hasn't happened with text

-<br>

people don't seem alarmed at the idea that the text they publish on

the<br>

web gets re-used.  Let's all pray it stays that way (though sooner

or<br>

later we're bound to get chancers trying it on - can't help fearing

the<br>

web is in its honeymoon phase, and the racketeers will mess it all

up<br>

before too long).<br><br>

In the meantime - take courage! Do it!<br><br>

<br>

<x-tab>        </x-tab>Adam<br><br>

<br>

=======================<br>

Adam Kilgarriff<br>

Lexicography MasterClass Ltd:  

<a href="http://www.lexmasterclass.com/" eudora="autourl">http://www.lexmasterclass.com</a>

<br>

adam@lexmasterclass.com <br>

+44 (0)1273 705773<br>

     --and--<br>

ITRI, University of Brighton<br>

Lewes Road, Brighton BN2 0BL, UK<br>

<a href="http://www.itri.brighton.ac.uk/~Adam.Kilgarriff" eudora="autourl">http://www.itri.brighton.ac.uk/~Adam.Kilgarriff</a> <br>

adam@itri.brighton.ac.uk<br>

+44 (0)1273 642919<br>

==============================<br>

World is crazier and more of it than we think,<br>

Incorrigibly plural   <br>

                         ---'Snow', Louis MacNeice<br>

==============================<br>

 <br><br>

> -----Original Message-----<br>

> From: owner-corpora@lists.uib.no [<a href="mailto:owner-corpora@lists.uib.no" eudora="autourl">mailto:owner-corpora@lists.uib.no</a>]<br>

On<br>

> Behalf Of delucca@nilc.icmc.usp.br<br>

> Sent: 13 June 2003 13:49<br>

> To: corpora@hd.uib.no<br>

> Subject: [Corpora-List] Legal aspects of compiling corpora<br>

> <br>

> <br>

> Dear Linguists and Lawyers,<br>

> <br>

> I am troubled with Legal aspects of corpora compiling. I am in<br>

> doubt if is an illegal procedure storage webpages (or part of them)<br>

> in a database (see at http://www.dictionarium.com/project.htm),<br>

> not available to public, and display its contents as short<br>

collocations<br>

> less than 100 characters by time by search method.<br>

> <br>

> On the other hand, the Internet search engines uses cached (temporary<br>

?)<br>

> copies of the sites and display a short of the web pages.<br>

> <br>

> My procedure is wrong? Which the Legal difference? I need ask<br>

permission<br>

> for each website to storage its pages? If I mention the source and the<br>

> author<br>

> I will be protecting the copyrights?<br>

> <br>

> <br>

> I look forward to hearing from you.<br>

> <br>

> <br>

> Yours Sincerely,<br>

> <br>

> <br>

> J. L. De Lucca<br>

> <br>

> -------------------------------------------------<br>

> This mail sent through IMP: <a href="http://horde.org/imp/" eudora="autourl">http://horde.org/imp/</a></blockquote><br>

*************************************************************<br>

Khalid CHOUKRI  <a href="mailto:choukri@elda.fr" eudora="autourl">mailto:choukri@elda.fr</a><br>

ELRA CEO<br>

Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30<br>

Postal Mail: 55 Rue Brillat-Savarin, 75013 Paris France<br>

Home page:  <a href="http://www.elda.fr/" eudora="autourl">http://www.elda.fr/</a> or <a href="http://www.elra.info/" eudora="autourl">http://www.elra.info/</a><br>

LREC News: <a href="http://www.lrec-conf.org/" eudora="autourl"><font color="#FF00FF">http://www.lrec-conf.org/</a><br>

</font>*************************************************************** </blockquote></blockquote></body>

<br>


*************************************************************<br>

Khalid CHOUKRI  <a href="mailto:choukri@elda.fr" eudora="autourl">mailto:choukri@elda.fr</a><br>

ELRA CEO<br>

Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30<br>

Postal Mail: 55 Rue Brillat-Savarin, 75013 Paris France<br>

Home page:  <a href="http://www.elda.fr/" eudora="autourl">http://www.elda.fr/</a> or <a href="http://www.elra.info/" eudora="autourl">http://www.</a>elra.info<a href="http://www.elra.info/" eudora="autourl">/<br>

</a>LREC News: <a href="http://www.lrec-conf.org/" eudora="autourl"><font color="#FF00FF">http://www.lrec-conf.org/</a><br>

</font>*************************************************************** </html>