<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#663300">

    <font face="Cambria"><br>

      <br>

      Some of you, interested in general concepts, may want to read the

    </font>article of Ole Norling-Christensen: Habeas corpus<br>

    published in the ELRA newsletter in ... 1996 <br>

    <br>

    <a class="moz-txt-link-freetext" href="http://www.elra.info/Newsletters-from-1996.html#1996">http://www.elra.info/Newsletters-from-1996.html#1996</a><br>

    Best regards<br>

    Khalid<br>

    <br>

    <br>

    <br>

    Laurence Anthony wrote, On 08/10/2012 16:46:

    <blockquote

cite="mid:CAL6Fgv2tK3Cz+0TVr0VS+Nv_tTBmQzqAV3SgMW4daz=gxFYH4g@mail.gmail.com"

      type="cite">

      <pre wrap="">On Mon, Oct 8, 2012 at 9:32 PM, Krishnamurthy, Ramesh

<a class="moz-txt-link-rfc2396E" href="mailto:r.krishnamurthy@aston.ac.uk"><r.krishnamurthy@aston.ac.uk></a> wrote:

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">It seems to me that many corpus studies attempt to describe *language

usage in some target domain* based on the analysis of a corpus.

</pre>

        </blockquote>

        <pre wrap="">

Language description may have been the focus in earlier corpus linguistics.

The field has developed since then, and many corpus studies use language

description as part of the means to making statements about wider social

issues, eg forensics, pedagogy, politics, etc?

</pre>

      </blockquote>

      <pre wrap="">

Agreed. But as you write yourself, "many corpus studies use language

description" to make statements about these wider issues. In this

case, your "language description" is my "describe language usage" and

your "wider issues" is my "target domain". So, your example simply

supports my earlier statement rather than contracts it.

</pre>

      <blockquote type="cite">

        <blockquote type="cite">

          <pre wrap="">I assume that the implication here is that the corpus is in some way

representative of the target domain (for a particular feature). If it

isn't and the corpus is simply "a (digitized) collection of texts", it

means that none of these authors can *assume* that their results are

generalizable in any way.

</pre>

        </blockquote>

        <pre wrap="">

As in all fields, the corpus/dataset we have collected is all we can actually analyse.

The representativeness or not of this dataset to some other notional dataset is part

of the claim being made by the researcher, and readers can evaluate the degree of

validity of the claim.

</pre>

      </blockquote>

      <pre wrap="">

Ahh, my point is more subtle. If a researcher intends to make claims

about the target domain based on a corpus and at the same time the

researcher makes no assumptions about the representativeness of the

corpus itself, i.e.,it's just "a collection of texts", then the claim

itself is unfounded. Only when the researcher assumes (rightly or

wrongly) that the corpus is representative, can the claim be made. Of

course, readers can then assess that claim based on a comparison with

other data (corpora) and the data (corpus) itself, but this is a

separate issue. My point here is looking at the claim for the

researcher's perspective.

</pre>

      <blockquote type="cite">

        <pre wrap="">Surely no researcher can *assume* anything? The generalizability or not of their

statements/results is again a matter for reader judgment?

</pre>

      </blockquote>

      <pre wrap="">

Yes they can! Researchers can assume A, and then derive B from A. We

do it all the time in science. Do a concordance search of any math

corpus and you'll find assumptions everywhere. Here's a concrete

example,

"Assuming the Brown corpus is representative of general English, we

find that the most frequent word used in general English is 'the'."

Of course, the assumption may be wrong, and that's what the reader can

judge ("Is the Brown corpus really representative of general

English?"). But, the audience can also judge the correctness of the

statement *in the case that the assumption is true*. Here, the

judgement would be, "Is the most frequent word used in General

English, based on frequencies in the Brown Corpus, the word "the"?

They are two different levels of question. The first leads to better

assumptions  and thus advancement in our understanding of the concept

(e.g, 'general English'). The second leads to more accurate results

based on assumptions (e..g, ways to count word frequencies).

</pre>

      <blockquote type="cite">

        <pre wrap="">#5 Laurence wrote:

</pre>

        <blockquote type="cite">

          <pre wrap="">In our field, the corpus is the starting point. By comparing the

results of previous corpora studies, we build *better* corpora (for a

particular language feature), and ultimately better models (of that

language feature).

</pre>

        </blockquote>

        <pre wrap="">

I'm not sure what you mean by 'language feature'. The corpus is collected

on external criteria, the 'language features' emerge from the analysis?

</pre>

      </blockquote>

      <pre wrap="">

Here, I mean things like "connectives", "past tense", etc. "Past

tense" does not emerge from the analysis. If we want to investigate

"past tense usage in research paper methods sections ", we could

collect methods sections in applied linguistics research papers.  But,

someone could build a better corpus by collecting methods sections in

multiple disciplines. I hope that clarifies this point.

</pre>

      <blockquote type="cite">

        <pre wrap="">Its (corpus) appropriacy or not - for a subsequently specified purpose - is

an evaluation we make in response to its use as a research dataset and

the 'possibly related' observations made from its analysis. The reader judges

the correlation between the data and the findings, and the extent of the

predictive (probabilistic) power of the statements to other texts/datasets.

</pre>

      </blockquote>

      <pre wrap="">

Agreed. But, see my earlier point about the need for the researcher to

have a basis for making a claim in the first place. I would argue that

the author *must* start with the assumption (see the point above) that

the corpus is representative of the target domain (rightly or

wrongly), before any claims can be made.

So perhaps a good definition of corpus is the following:

"A corpus is a collection of (digitized) texts that is *assumed to be*

representative of a target domain."

To me this captures everything we have discussed in this thread and

also addresses the issues of representativeness mentioned above. The

"assumed to be" wording is critical because it is the foundation of

all linguistic inquiry but it also addresses the reality that the

assumption might be wrong.

Sorry to drag this discussion on even longer!

Laurence.

_______________________________________________

UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>

Corpora mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>

<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>

</pre>

    </blockquote>

    <br>

    <div class="moz-signature">-- <br>

      <b> Khalid Choukri </b>

      <br>

      ELRA General secretary & ELDA CEO

      <br>

      email: <a class="moz-txt-link-abbreviated" href="mailto:choukri@elda.org">choukri@elda.org</a>; <br>

      Web: <a class="moz-txt-link-abbreviated" href="http://www.elra.info">www.elra.info</a> <a class="moz-txt-link-abbreviated" href="http://www.elda.org">www.elda.org</a>

      <br>

      Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30

      <br>

      <br>

      <b> ***************************************************<br>

        ** Info on LREC 2012 : <a class="moz-txt-link-abbreviated" href="http://www.lrec-conf.org">www.lrec-conf.org</a> <br>

        ***************************************************<br>

      </b></div>

  </body>

</html>