<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#663300">
<font face="Cambria"><br>
<br>
Some of you, interested in general concepts, may want to read the
</font>article of Ole Norling-Christensen: Habeas corpus<br>
published in the ELRA newsletter in ... 1996 <br>
<br>
<a class="moz-txt-link-freetext" href="http://www.elra.info/Newsletters-from-1996.html#1996">http://www.elra.info/Newsletters-from-1996.html#1996</a><br>
Best regards<br>
Khalid<br>
<br>
<br>
<br>
Laurence Anthony wrote, On 08/10/2012 16:46:
<blockquote
cite="mid:CAL6Fgv2tK3Cz+0TVr0VS+Nv_tTBmQzqAV3SgMW4daz=gxFYH4g@mail.gmail.com"
type="cite">
<pre wrap="">On Mon, Oct 8, 2012 at 9:32 PM, Krishnamurthy, Ramesh
<a class="moz-txt-link-rfc2396E" href="mailto:r.krishnamurthy@aston.ac.uk"><r.krishnamurthy@aston.ac.uk></a> wrote:
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">It seems to me that many corpus studies attempt to describe *language
usage in some target domain* based on the analysis of a corpus.
</pre>
</blockquote>
<pre wrap="">
Language description may have been the focus in earlier corpus linguistics.
The field has developed since then, and many corpus studies use language
description as part of the means to making statements about wider social
issues, eg forensics, pedagogy, politics, etc?
</pre>
</blockquote>
<pre wrap="">
Agreed. But as you write yourself, "many corpus studies use language
description" to make statements about these wider issues. In this
case, your "language description" is my "describe language usage" and
your "wider issues" is my "target domain". So, your example simply
supports my earlier statement rather than contracts it.
</pre>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">I assume that the implication here is that the corpus is in some way
representative of the target domain (for a particular feature). If it
isn't and the corpus is simply "a (digitized) collection of texts", it
means that none of these authors can *assume* that their results are
generalizable in any way.
</pre>
</blockquote>
<pre wrap="">
As in all fields, the corpus/dataset we have collected is all we can actually analyse.
The representativeness or not of this dataset to some other notional dataset is part
of the claim being made by the researcher, and readers can evaluate the degree of
validity of the claim.
</pre>
</blockquote>
<pre wrap="">
Ahh, my point is more subtle. If a researcher intends to make claims
about the target domain based on a corpus and at the same time the
researcher makes no assumptions about the representativeness of the
corpus itself, i.e.,it's just "a collection of texts", then the claim
itself is unfounded. Only when the researcher assumes (rightly or
wrongly) that the corpus is representative, can the claim be made. Of
course, readers can then assess that claim based on a comparison with
other data (corpora) and the data (corpus) itself, but this is a
separate issue. My point here is looking at the claim for the
researcher's perspective.
</pre>
<blockquote type="cite">
<pre wrap="">Surely no researcher can *assume* anything? The generalizability or not of their
statements/results is again a matter for reader judgment?
</pre>
</blockquote>
<pre wrap="">
Yes they can! Researchers can assume A, and then derive B from A. We
do it all the time in science. Do a concordance search of any math
corpus and you'll find assumptions everywhere. Here's a concrete
example,
"Assuming the Brown corpus is representative of general English, we
find that the most frequent word used in general English is 'the'."
Of course, the assumption may be wrong, and that's what the reader can
judge ("Is the Brown corpus really representative of general
English?"). But, the audience can also judge the correctness of the
statement *in the case that the assumption is true*. Here, the
judgement would be, "Is the most frequent word used in General
English, based on frequencies in the Brown Corpus, the word "the"?
They are two different levels of question. The first leads to better
assumptions and thus advancement in our understanding of the concept
(e.g, 'general English'). The second leads to more accurate results
based on assumptions (e..g, ways to count word frequencies).
</pre>
<blockquote type="cite">
<pre wrap="">#5 Laurence wrote:
</pre>
<blockquote type="cite">
<pre wrap="">In our field, the corpus is the starting point. By comparing the
results of previous corpora studies, we build *better* corpora (for a
particular language feature), and ultimately better models (of that
language feature).
</pre>
</blockquote>
<pre wrap="">
I'm not sure what you mean by 'language feature'. The corpus is collected
on external criteria, the 'language features' emerge from the analysis?
</pre>
</blockquote>
<pre wrap="">
Here, I mean things like "connectives", "past tense", etc. "Past
tense" does not emerge from the analysis. If we want to investigate
"past tense usage in research paper methods sections ", we could
collect methods sections in applied linguistics research papers. But,
someone could build a better corpus by collecting methods sections in
multiple disciplines. I hope that clarifies this point.
</pre>
<blockquote type="cite">
<pre wrap="">Its (corpus) appropriacy or not - for a subsequently specified purpose - is
an evaluation we make in response to its use as a research dataset and
the 'possibly related' observations made from its analysis. The reader judges
the correlation between the data and the findings, and the extent of the
predictive (probabilistic) power of the statements to other texts/datasets.
</pre>
</blockquote>
<pre wrap="">
Agreed. But, see my earlier point about the need for the researcher to
have a basis for making a claim in the first place. I would argue that
the author *must* start with the assumption (see the point above) that
the corpus is representative of the target domain (rightly or
wrongly), before any claims can be made.
So perhaps a good definition of corpus is the following:
"A corpus is a collection of (digitized) texts that is *assumed to be*
representative of a target domain."
To me this captures everything we have discussed in this thread and
also addresses the issues of representativeness mentioned above. The
"assumed to be" wording is critical because it is the foundation of
all linguistic inquiry but it also addresses the reality that the
assumption might be wrong.
Sorry to drag this discussion on even longer!
Laurence.
_______________________________________________
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
Corpora mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
</pre>
</blockquote>
<br>
<div class="moz-signature">-- <br>
<b> Khalid Choukri </b>
<br>
ELRA General secretary & ELDA CEO
<br>
email: <a class="moz-txt-link-abbreviated" href="mailto:choukri@elda.org">choukri@elda.org</a>; <br>
Web: <a class="moz-txt-link-abbreviated" href="http://www.elra.info">www.elra.info</a> <a class="moz-txt-link-abbreviated" href="http://www.elda.org">www.elda.org</a>
<br>
Tel. +33 1 43 13 33 33 - Fax. +33 1 43 13 33 30
<br>
<br>
<b> ***************************************************<br>
** Info on LREC 2012 : <a class="moz-txt-link-abbreviated" href="http://www.lrec-conf.org">www.lrec-conf.org</a> <br>
***************************************************<br>
</b></div>
</body>
</html>