<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <tt>I think Robert puts it pretty well. My reaction was simply to

      look up 'corpus' in the online Oxford dictionary, where 'corpus'

      has two senses: the main sense is "<span class="definition">a

        collection of written texts, especially the entire works of a

        particular author or a body of writing on a particular subject

        (e.g., the Darwinian corpus)" and a <b>subsense</b>, "<span

          class="definition">a collection of written or spoken material

          in machine-readable form, assembled for the purpose of

          linguistic research". I think these pretty well subsume and

          obviate all the points made in this discussion.<br>

          <br>

              Ken<br>

          <br>

        </span></span></tt>

    <div class="moz-cite-prefix">On 10/6/2012 12:08 PM,

      <a class="moz-txt-link-abbreviated" href="mailto:amsler@cs.utexas.edu">amsler@cs.utexas.edu</a> wrote:<br>

    </div>

    <blockquote

      cite="mid:20121006110825.ibpw38lv4scg8kck@webmail.utexas.edu"

      type="cite">The simplest summary I came away with is that a corpus

      is a set of

      <br>

      texts that has a proposed purpose of study. At least one person

      must

      <br>

      have an intention for the collection to serve a purpose. The

      <br>

      unanswered question is whether a corpus has to even be texts, or

      can

      <br>

      it be a corpus of other types of data; such as corpus of lexical

      <br>

      items, a corpus of musical recordings, or a corpus of video clips.

      <br>

      <br>

      This definition of a corpus means that it may not be recognized as

      a

      <br>

      corpus by anyone else other than its collector/creator. It may

      appear

      <br>

      to be a random set of pages, a hapstance collection of books, etc.

      <br>

      unless you figure out what they share in common. And note that

      <br>

      'randomness' is a purpose. Some of the most important corpora are

      <br>

      those whose purpose is to be a random sample (or 'representative')

      <br>

      sample of something. The Brown Corpus tried to be representative

      by

      <br>

      being random. I suppose randomness requires every instance of the

      set

      <br>

      collected from had an equal chance of being included--and

      <br>

      representativeness requires enough items are collected to reflect

      the

      <br>

      properties of the set collected from. Ah... but what "properties",

      eh.

      <br>

      <br>

      This is why a corpus needs an explanation of its properties, its

      <br>

      reason for it being a corpus, to guarantee its recognition as a

      corpus

      <br>

      and its utility to others.

      <br>

      <br>

      The discussion as to whether something deserves to be called a

      corpus is picky.

      <br>

      AS they say, we want big tent that invites in as many as possible.

      <br>

      <br>

      We should be discussing what constitutes "best practices" and not

      trying to deny membership in the set of corpora to collections

      that don't meet all the criteria. I'd be happier to learn of the

      levels of qualifications that a corpus should have. Good

      documentation. Availability. Size. "Representativeness" (of

      what?). Annotations. Indexes of elements (spellings, phrases,

      named entities, disambiguation of senses).

      <br>

      <br>

      How to make a corpus that adheres to "best practices" would be

      more useful than deciding on whether someone's purposeful

      collection of text qualified to be called a corpus by everyone.

      <br>

      <br>

      <br>

      <br>

      <br>

      <br>

      ----- End forwarded message -----

      <br>

      <br>

      <br>

      _______________________________________________

      <br>

      UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

      <br>

      Corpora mailing list

      <br>

      <a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

      <br>

      <a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

      <br>

      <br>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Ken Litkowski                     TEL.: 301-482-0237

CL Research                       EMAIL: <a class="moz-txt-link-abbreviated" href="mailto:ken@clres.com">ken@clres.com</a>

9208 Gue Road                     Home Page: <a class="moz-txt-link-freetext" href="http://www.clres.com">http://www.clres.com</a>

Damascus, MD 20872-1025 USA       Blog: <a class="moz-txt-link-freetext" href="http://www.clres.com/blog">http://www.clres.com/blog</a>

</pre>

  </body>

</html>