<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Are we not slightly reinventing the wheel?<br>

    <br>

    The nature of corpora has been discussed for years, EAGLES was about

    defining it. In 2005, John Sinclair enlarged upon the 1996

    definition when he wrote :<br>

    <br>

    <blockquote type="cite">A corpus is a collection of pieces of

      language text in electronic format, selected according to external

      criteria to represent, as far as possible, a language or language

      variety as a source of data for linguistic research.</blockquote>

    Sinclair J. McH. . 2005. ‘Corpus and Text: Basic Principles’. In

    Wynne, M (ed). 2005. pp. 1-16.  Wynne, M (ed). 2005. Developing

    Linguistic Corpora: A Guide to Good Practice. Oxford: AHDS 6 -<br>

    <br>

    It is also on the web!<br>

    <br>

    Surely anyone involved in corpora has read the seminal works and

    does not need reminding that corpora are machine-readable, maybe

    samples or whole works etc. What has changed is the rise of internet

    corpora, but here too Kilgarriff and others have commented the

    situation in a way that both NLP and corpus linguistic users can

    feel at home with.<br>

    <br>

    Best<br>

    <br>

    Geoffrey<br>

    <br>

    B<br>

    <br>

    <br>

    <div class="moz-cite-prefix">Le 03/10/2012 18:02, Graham White a

      écrit :<br>

    </div>

    <blockquote cite="mid:506C61B2.9040701@eecs.qmul.ac.uk" type="cite">I

      quite agree about machine-readability: the reason that we use the

      Latin word corpus is that the Romans already had corpora, such as

      this one: <a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Corpus_Juris_Civilis">http://en.wikipedia.org/wiki/Corpus_Juris_Civilis</a>

      <br>

      (which is just as good a corpus as anything machine-readable).

      <br>

      <br>

      A corpus should possibly, also, be public and collected for some

      purpose: the books on my bookshelf aren't a corpus, for example,

      but if someone wanted to investigate them as an example of what a

      computer scientist read, then they would be. But it's a hard

      criterion to formulate.

      <br>

      <br>

      Graham

      <br>

      <br>

      On 03/10/12 16:12, Krishnamurthy, Ramesh wrote:

      <br>

      <blockquote type="cite">Hi Yuri

        <br>

        <br>

        <br>

        <br>

        I agree broadly with Adam.

        <br>

        <br>

        <br>

        <br>

        I would add a couple of points for clarification:

        <br>

        <br>

        (i) Some corpus *techniques* (eg word frequency lists,

        collocation) may be applied to any piece of text,

        <br>

        <br>

        eg to a single chapter in a novel by Dickens.

        <br>

        <br>

        (ii) The contents of a corpus determine the scope and nature of

        the statements one can make, and the degree

        <br>

        <br>

        of confidence with which we can make them:  eg a single chapter

        or even a single novel would only allow us to make

        <br>

        <br>

        limited statements/suggestions, with a lower degree of

        confidence; a complete collection of his novels would allow

        <br>

        <br>

        us to make more general statements about Dickens' novelistic

        style, with greater confidence, and we could for example

        <br>

        <br>

        compare the novels and discover developments in his novelistic

        style from the first novel to the last, etc.

        <br>

        <br>

        <br>

        <br>

        Kevin's comment about machine-readable reflects the age we live

        in, and the technology now available to many.

        <br>

        <br>

        I'm not sure about his distinction between 'document collection'

        and corpus, or what kind of annotation he means.

        <br>

        <br>

        For me, a corpus can be unannotated or annotated (eg with

        metadata about each text in the corpus, or POS-tags,

        <br>

        <br>

        semantic tags, pragmatic tags, discourse tags, etc).

        <br>

        <br>

        <br>

        <br>

        best

        <br>

        <br>

        Ramesh

        <br>

        <br>

-----------------------------------------------------------------------------------

        <br>

        <br>

        Date: Tue, 2 Oct 2012 19:21:21 +0700

        <br>

        From: "Yuri Tambovtsev" <a class="moz-txt-link-rfc2396E" href="mailto:yutamb@mail.ru"><yutamb@mail.ru></a>

        <br>

        Subject: [Corpora-List] What is corpora and what is not?

        <br>

        To: <a class="moz-txt-link-rfc2396E" href="mailto:corpora@uib.no"><corpora@uib.no></a>

        <br>

        <br>

        Dear corpora members, I do not understand, what corpora is and

        what corpora is not. Is the set the text of books by Charles

        Dickens is a Dickens corpora? What about the books of Ernst

        Hemingway and other writers? Looking forward to hearing your

        opinion to <a class="moz-txt-link-abbreviated" href="mailto:yutamb@mail.ru">yutamb@mail.ru</a> Yours sincerely Yuri Tambovtsev,

        Novosibirsk, Russia

        <br>

        <br>

------------------------------------------------------------------------------------

        <br>

        <br>

        Date: Tue, 2 Oct 2012 15:11:11 +0100

        <br>

        From: Adam Kilgarriff <a class="moz-txt-link-rfc2396E" href="mailto:adam@lexmasterclass.com"><adam@lexmasterclass.com></a>

        <br>

        Subject: Re: [Corpora-List] What is corpora and what is not?

        <br>

        To: Yuri Tambovtsev <a class="moz-txt-link-rfc2396E" href="mailto:yutamb@mail.ru"><yutamb@mail.ru></a>

        <br>

        Cc: <a class="moz-txt-link-abbreviated" href="mailto:corpora@uib.no">corpora@uib.no</a>

        <br>

        <br>

        Yuri,

        <br>

        <br>

        a corpus is a collection of texts/speech. We call it a corpus

        when we view

        <br>

        it as an object of linguistics or literary research. The answers

        to your

        <br>

        questions are yes and yes.

        <br>

        <br>

        Adam

        <br>

        <br>

        ========================================

        <br>

        Adam Kilgarriff <a class="moz-txt-link-rfc2396E" href="http://www.kilgarriff.co.uk/"><http://www.kilgarriff.co.uk/></a>

        <br>

        <a class="moz-txt-link-abbreviated" href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a>

        <br>

        Director Lexical Computing

        <br>

        Ltd<a class="moz-txt-link-rfc2396E" href="http://www.sketchengine.co.uk/"><http://www.sketchengine.co.uk/></a>

        <br>

        <br>

        Visiting Research Fellow University of

        <br>

        Leeds<a class="moz-txt-link-rfc2396E" href="http://leeds.ac.uk"><http://leeds.ac.uk></a>

        <br>

        <br>

        *Corpora for all* with the Sketch Engine

        <a class="moz-txt-link-rfc2396E" href="http://www.sketchengine.co.uk"><http://www.sketchengine.co.uk></a>

        <br>

        <br>

        *DANTE: a lexical database for

        <br>

        English<a class="moz-txt-link-rfc2396E" href="http://www.webdante.com"><http://www.webdante.com></a>

        <br>

        <br>

----------------------------------------------------------------------------

        <br>

        <br>

        Date: Tue, 2 Oct 2012 08:59:21 -0600

        <br>

        From: "Kevin B. Cohen" <a class="moz-txt-link-rfc2396E" href="mailto:kevin.cohen@gmail.com"><kevin.cohen@gmail.com></a>

        <br>

        Subject: Re: [Corpora-List] What is corpora and what is not?

        <br>

        To: Yuri Tambovtsev <a class="moz-txt-link-rfc2396E" href="mailto:yutamb@mail.ru"><yutamb@mail.ru></a>

        <br>

        Cc: <a class="moz-txt-link-abbreviated" href="mailto:corpora@uib.no">corpora@uib.no</a>

        <br>

        <br>

        Hi, Yuri,

        <br>

        <br>

        Different people have differing definitions of what constitutes

        a

        <br>

        corpus. Here are a couple of them:

        <br>

        <br>

        Meyer:

        <br>

        <br>

        "a collection of texts or parts of texts upon which some general

        <br>

        linguistic analysis can be conducted"

        <br>

        "a body of text made available in computer-readable form for

        purposes

        <br>

        of linguistic analysis"

        <br>

        <br>

        McEnery and Wilson:

        <br>

        <br>

        McEnery & Wilson:

        <br>

        (i) (loosely) any body of text

        <br>

        (ii) (most commonly) a body of machine-readable text

        <br>

        (iii) (more strictly) a finite collection of machine-readable

        text,

        <br>

        sampled to be maximally representable of a language or variety

        <br>

        <br>

        You'll notice that a common element of the definitions is the

        notion

        <br>

        of machine-readability.

        <br>

        <br>

        Some people distinguish between a "document collection" and a

        corpus.

        <br>

        In this case, the difference is that a corpus has some sort of

        <br>

        annotations, while a document collection is a set of unannotated

        <br>

        documents. Sorry I don't have a citation for this.

        <br>

        <br>

        Kev

        <br>

        <br>

        --

        <br>

        Kevin Bretonnel Cohen, PhD

        <br>

        Biomedical Text Mining Group Lead, Computational Bioscience

        Program,

        <br>

        U. Colorado School of Medicine

        <br>

        303-916-2417 (cell) 303-377-9194 (home)

        <br>

        <a class="moz-txt-link-freetext" href="http://compbio.ucdenver.edu/Hunter_lab/Cohen">http://compbio.ucdenver.edu/Hunter_lab/Cohen</a>

        <br>

        <br>

        <br>

        _______________________________________________

        <br>

        UNSUBSCRIBE from this page:

        <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

        <br>

        Corpora mailing list

        <br>

        <a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

        <br>

        <a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

        <br>

        <br>

      </blockquote>

      <br>

      <br>

      _______________________________________________

      <br>

      UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

      <br>

      Corpora mailing list

      <br>

      <a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

      <br>

      <a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

      <br>

    </blockquote>

    <br>

    <div class="moz-signature">-- <br>

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <style type="text/css">

<!--

.Style1 {font-family: Arial, Helvetica, sans-serif}

.Style5 {font-size: 12px}

.Style6 {font-size: 14px}

-->

  </style>

      <p><span class="Style1"><span class="Style6"><strong><br>

              Professor Geoffrey WILLIAMS. MSc, PhD

            </strong><i><br>

              Director of Department for Document Management, Directeur

              du

              Département d'Ingénierie du document<br>

              LiCoRN - HCTI.

            </i></span><br>

------------------------------------------------------------------------<br>

          <span class="Style5">

            <a class="moz-txt-link-abbreviated" href="mailto:geoffrey.williams@univ-ubs.fr">geoffrey.williams@univ-ubs.fr</a>

            <br>

            tél. +33 (0)2 97 87 29 20 - fax. +33 (0)2 97 87 29 31

            <br>

            Faculté de Lettres Langues Sciences Humaines

            <br>

            et Sociales (LSHS)

            <br>

            4 rue Jean Zay <br>

            BP92113, 56321 LORIENT CEDEX<br>

            UNIVERSITÉ DE BRETAGNE-SUD

            <br>

            <a class="moz-txt-link-abbreviated" href="http://www.univ-ubs.fr">www.univ-ubs.fr</a>

            / <a class="moz-txt-link-abbreviated" href="http://www.licorn.com">www.licorn.com</a><br>

          </span></span></p>

      <hr style="width: 100%; height: 2px;">

      <p>New Book: European Identity: What the media say. Paul Bayley

        and Geoffrey Williams (eds). Oxford: OUP<br>

        <a href="http://ukcatalogue.oup.com/product/9780199602308.do">http://ukcatalogue.oup.com/product/9780199602308.do</a><br>

      </p>

      <p><br>

      </p>

      <p>

        <a href="http://www.univ-ubs.fr/" target="_blank"><br>

        </a></p>

    </div>

  </body>

</html>