<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#6600CC">

    Dear Damir, Colleagues<br>

    <br>

    if I may put in my two cents (or my two euros!!) in this discussion<br>

    1) I assume that we have to interpret the previous message from a US

    perspective; As you are in the US , the fair use is a very common

    doctrine, unfortunately you can not use such arguments in a large

    number of countries  in particular in Europe;  UK has moved recently

    to this but not clear yet.<br>

    2) in fair use , the first of the four factors is <b>(1) the

      purpose and character of the use, including whether such use is of

      a commercial nature or is for nonprofit educational purposes</b>; 

    so for research and educational purposes (including sharing data for

    research) I second Orion arguments.<br>

    <br>

    3) the common misinterpretation is "can you trace the sources of my

    "list of words" or what ever "derived outcomes" from the copyrighted

    data; my legal advisors insist often that in "Copyright

    infringement" the key word is "Copy" , so as long as you copy data

    (that is what we do when we crawl/harvest etc. you are already

    infringing the Copyright law.<br>

    <br>

    But the other key word that my lawyers utter more often is "risk",

    what is the risk (i.e. what would be the benefit for copyright

    owners to sue you!!), I guess you are the only one who can assess

    this ; in Europe everyone is trying to sue Google , Microsoft etc.

    !!<br>

    <br>

    all the best<br>

    Khalid<br>

    <br>

    <br>

     <br>

    <div class="moz-cite-prefix">On 2015-01-06 07:15, Orion Montoya

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAE7YMx8t5o0UR8H6rgyHJwTFYVvc_PUumQXY-UiH2bBypiDUgA@mail.gmail.com"

      type="cite">

      <div dir="ltr">Word lists and frequency profiles would seem to be

        safely in the realm of fair use: <a moz-do-not-send="true"

          href="http://en.wikipedia.org/wiki/Fair_use" target="_blank">http://en.wikipedia.org/wiki/Fair_use</a>

        . The Google Books Ngrams data, distributed up to 12-grams by

        Google, are one example of people distributing rather high-N

        ngrams. Of course Google fought with the Authors Guild over

        Google Books in general, but I don't recall this distribution of

        ngram data being part of their fight, and in the end the Authors

        Guild didn't win the obscurity they were pleading for. For

        another example, <a moz-do-not-send="true"

          href="http://commoncrawl.org/" target="_blank">http://commoncrawl.org/</a>

        distributes a massive crawl of the web for researchers (or

        anybody) which is far more wholesale copying+redistribution than

        you're proposing, but they follow the normal rules that

        webcrawlers follow and are doing just fine (and are a very

        useful resource!). 

        <div><br>

        </div>

        <div>So I would personally have zero legal worry about what

          you're proposing. I would have no qualms about either academic

          research or commercial applications (or commercial

          distribution) of that derived data. Adam is (as usual) right,

          that you shouldn't even ask anybody for permission.

          <div><br>

          </div>

          <div>The thing about fair use that can make university lawyers

            uncomfortable is that it's an "affirmative defense" -- you

            can argue it in court if someone sues you, but there's no

            guarantee that you can use it to stay out of court in the

            first place, which can be expensive.</div>

          <div><br>

          </div>

          <div>But the other thing about the fair use defense is that,

            in order for you to use it, somebody needs to be able to

            claim that you're infringing their copyright in the first

            place. If you're just distributing frequency lists, there's

            no trace of a copyrighted work to be found; even at the

            5-gram level, it's very hard to find any actionable

            infringement: the fourth principle to be considered in

            evaluating fair use is "<span

style="color:rgb(37,37,37);font-family:sans-serif;font-size:13px;line-height:21.2800006866455px;background-color:rgb(249,249,249)">the

              effect of the use upon the potential market for or value

              of the copyrighted work" and in your case that effect

              should be exactly nil.</span></div>

          <div><br>

          </div>

        </div>

        <div>You could save yourself a bit of busywork, and maybe offer

          your university's lawyers some psychological insulation from

          legal risk, by using existing corpora resources like Common

          Crawl.</div>

        <div><br>

        </div>

        <div>Part C of your question --- "are there jurisdictions where

          this might be illegal" --- is the fuzziest to answer; the

          Berne Convention allows signatory countries to define fair use

          for themselves, so there might be jurisdictions where this

          could be risky, but they're probably places for which it's

          challenging to get a visa anyway. I am not a lawyer, just a

          copyright geek and a subscriber of "5 Useful Articles" by

          Parker Higgins & Sarah Jeong, <a moz-do-not-send="true"

            href="http://tinyletter.com/5ua">http://tinyletter.com/5ua</a>

          , an amusing and edifying weekly email about the inherent

          comedy of US IP law in the 21st century.</div>

        <div><br>

        </div>

        <div>Cheers,</div>

        <div><br>

        </div>

        <div>Orion</div>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">On Mon, Jan 5, 2015 at 9:30 PM, Adam

            Kilgarriff <span dir="ltr"><<a moz-do-not-send="true"

                href="mailto:adam.kilgarriff@sketchengine.co.uk"

                target="_blank">adam.kilgarriff@sketchengine.co.uk</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px

0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

              <div dir="ltr">Dear Damir,

                <div><br>

                </div>

                <div>a few thoughts:

                  <div><br>

                  </div>

                  <div>In an innocent world view, the law says what is

                    allowed and what is not.  The more I see of how the

                    legal profession works, the clearer it is that it's

                    all political, in the sense that the judgements that

                    build the case law (at least in UK) are made based

                    on how well the lawyers played the game, how much

                    money was involved, who had a sniff of how much

                    money they might make.</div>

                  <div><br>

                  </div>

                  <div>It's not about what is legal (which is always, in

                    this area, underspecified), it is about risk

                    management.</div>

                  <div><br>

                  </div>

                  <div>If no-one sees a money-making opportunity, there

                    is very little legal risk since no-one will take you

                    to court.  </div>

                  <div>If you're a big organisation, you can always be

                    taken to court and sued for large sums.  This has

                    had horrible consequences for the JISC group at

                    ISPRA: they are part of the EU, a very large

                    organisation, and have had their work restricted by

                    ambulance-chasing lawyers with a glint in their eyes

                    for winning plump settlements. </div>

                  <div><br>

                  </div>

                  <div>What you might be willing to do personally -

                    given that you are probably, not, as an individual,

                    worth suing, and your motivation for doing

                    interesting work is high - is very different to what

                    a (probably) rich organisation like your university

                    might be willing to do.  If you want to do

                    something, don't ask! (Specially not the university

                    lawyers.  You'll probably never get an answer - even

                    more frustrating than a simple 'no'.)</div>

                </div>

                <div><br>

                </div>

                <div>Sorry if that is not very helpful</div>

                <div><br>

                </div>

                <div>Adam</div>

                <div><br>

                </div>

              </div>

              <div class="gmail_extra"><br>

                <div class="gmail_quote">

                  <div>

                    <div>On 6 January 2015 at 04:00, Damir Cavar <span

                        dir="ltr"><<a moz-do-not-send="true"

                          href="mailto:dcavar@me.com" target="_blank">dcavar@me.com</a>></span>

                      wrote:<br>

                    </div>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px

                    0px

0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

                    <div>

                      <div>Hi everybody,<br>

                        <br>

                        I know, this question has been addressed a lot,

                        but, just to get an<br>

                        update on this issue and your expert opinion:<br>

                        <br>

                        If I am accessing the internet from the US, as I

                        am right now, and I<br>

                        decide to generate N-gram-based language models

                        by exploiting the web as<br>

                        a corpus and publish the word-lists and

                        frequency profiles openly on my<br>

                        homepage, sell them even, change or manipulate

                        them, and reuse them in<br>

                        various ways, would this be<br>

                        <br>

                        a. ok as fair-use for research only, excluding

                        commercial use<br>

                        b. legal in general, independent of my research

                        interests<br>

                        c. legal only in some countries (so, my models

                        would be illegal in some<br>

                        others)<br>

                        <br>

                        What is the current status of the web as a

                        corpus and extracted language<br>

                        models from the legal perspective in the US and

                        globally?<br>

                        <br>

                        If I do the same now with open-access journals

                        and extract frequency<br>

                        profiles of tokens for a certain research

                        domain, would it be the same?<br>

                        It I use Google Books? Or even some news

                        website?<br>

                        <br>

                        Is the extraction of a language model, maybe a

                        domain specific frequency<br>

                        profile a copyright infringement per se? The

                        text cannot be<br>

                        reconstructed, the content is not visible, the

                        authors style neither, in<br>

                        particular not, if the corpus is larger etc.<br>

                        <br>

                        Thanks!<br>

                        <br>

                        Damir<br>

                        <br>

                        <br>

                        <br>

                        --<br>

                        Damir Cavar<br>

                        Department of Linguistics<br>

                        Indiana University<br>

                        <br>

                        <br>

                        <br>

                      </div>

                    </div>

                    _______________________________________________<br>

                    UNSUBSCRIBE from this page: <a

                      moz-do-not-send="true"

                      href="http://mailman.uib.no/options/corpora"

                      target="_blank">http://mailman.uib.no/options/corpora</a><br>

                    Corpora mailing list<br>

                    <a moz-do-not-send="true"

                      href="mailto:Corpora@uib.no" target="_blank">Corpora@uib.no</a><br>

                    <a moz-do-not-send="true"

                      href="http://mailman.uib.no/listinfo/corpora"

                      target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

                    <br>

                  </blockquote>

                </div>

                <span><font color="#888888"><br>

                    <br clear="all">

                    <div><br>

                    </div>

                    -- <br>

                    <div>

                      <div dir="ltr">=============================================<br>

                        <a moz-do-not-send="true"

                          href="http://www.kilgarriff.co.uk/"

                          target="_blank">Adam Kilgarriff</a>          

                               <a moz-do-not-send="true"

                          href="mailto:adam@sketchengine.co.uk"

                          target="_blank">adam@sketchengine.co.uk</a>   

                                                                <br>

                        Director                                    <a

                          moz-do-not-send="true"

                          href="http://www.sketchengine.co.uk/"

                          target="_blank">Lexical Computing Ltd</a>    

                                   <br>

                        Visiting Research Fellow                 <a

                          moz-do-not-send="true"

                          href="http://leeds.ac.uk/" target="_blank">University

                          of Leeds</a>     

                        <div><i><font color="#006600">Corpora for all</font></i> with <a

                            moz-do-not-send="true"

                            href="http://www.sketchengine.co.uk/"

                            target="_blank">the Sketch Engine</a>   and

                               <a moz-do-not-send="true"

                            href="http://skell.sketchengine.co.uk/"

                            target="_blank">SKELL</a>       <i>         

                                 </i></div>

                        <div>=============================================</div>

                      </div>

                    </div>

                  </font></span></div>

              <br>

              _______________________________________________<br>

              UNSUBSCRIBE from this page: <a moz-do-not-send="true"

                href="http://mailman.uib.no/options/corpora"

                target="_blank">http://mailman.uib.no/options/corpora</a><br>

              Corpora mailing list<br>

              <a moz-do-not-send="true" href="mailto:Corpora@uib.no"

                target="_blank">Corpora@uib.no</a><br>

              <a moz-do-not-send="true"

                href="http://mailman.uib.no/listinfo/corpora"

                target="_blank">http://mailman.uib.no/listinfo/corpora</a><br>

              <br>

            </blockquote>

          </div>

          <br>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

</pre>

    </blockquote>

    <br>

    <div class="moz-signature">-- <br>

      <p> ************************************************* <br>

        <b> Khalid CHOUKRI </b> <br>

        ELRA General secretary & ELDA CEO <br>

        email: <a class="moz-txt-link-abbreviated" href="mailto:choukri@elda.org">choukri@elda.org</a> ; Web: <a class="moz-txt-link-abbreviated" href="http://www.elra.info">www.elra.info</a> <a class="moz-txt-link-abbreviated" href="http://www.elda.org">www.elda.org</a> <br>

        Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30 <br>

        *************************************************** <br>

        ** <b> Info on LREC: <a class="moz-txt-link-abbreviated" href="http://www.lrec-conf.org">www.lrec-conf.org</a> <b><br>

            **************************************************** </b></b></p>

      <b><b>

        </b></b></div>

  </body>

</html>