[Corpora-List] What is corpora and what is not?

Mon Oct 8 14:46:05 UTC 2012

On Mon, Oct 8, 2012 at 9:32 PM, Krishnamurthy, Ramesh
<r.krishnamurthy at aston.ac.uk> wrote:

>>It seems to me that many corpus studies attempt to describe *language
>>usage in some target domain* based on the analysis of a corpus.
>
> Language description may have been the focus in earlier corpus linguistics.
> The field has developed since then, and many corpus studies use language
> description as part of the means to making statements about wider social
> issues, eg forensics, pedagogy, politics, etc?

Agreed. But as you write yourself, "many corpus studies use language
description" to make statements about these wider issues. In this
case, your "language description" is my "describe language usage" and
your "wider issues" is my "target domain". So, your example simply
supports my earlier statement rather than contracts it.

>>I assume that the implication here is that the corpus is in some way
>>representative of the target domain (for a particular feature). If it
>>isn't and the corpus is simply "a (digitized) collection of texts", it
>>means that none of these authors can *assume* that their results are
>>generalizable in any way.
>
> As in all fields, the corpus/dataset we have collected is all we can actually analyse.
> The representativeness or not of this dataset to some other notional dataset is part
> of the claim being made by the researcher, and readers can evaluate the degree of
> validity of the claim.

Ahh, my point is more subtle. If a researcher intends to make claims
about the target domain based on a corpus and at the same time the
researcher makes no assumptions about the representativeness of the
corpus itself, i.e.,it's just "a collection of texts", then the claim
itself is unfounded. Only when the researcher assumes (rightly or
wrongly) that the corpus is representative, can the claim be made. Of
course, readers can then assess that claim based on a comparison with
other data (corpora) and the data (corpus) itself, but this is a
separate issue. My point here is looking at the claim for the
researcher's perspective.

> Surely no researcher can *assume* anything? The generalizability or not of their
> statements/results is again a matter for reader judgment?

Yes they can! Researchers can assume A, and then derive B from A. We
do it all the time in science. Do a concordance search of any math
corpus and you'll find assumptions everywhere. Here's a concrete
example,

"Assuming the Brown corpus is representative of general English, we
find that the most frequent word used in general English is 'the'."

Of course, the assumption may be wrong, and that's what the reader can
judge ("Is the Brown corpus really representative of general
English?"). But, the audience can also judge the correctness of the
statement *in the case that the assumption is true*. Here, the
judgement would be, "Is the most frequent word used in General
English, based on frequencies in the Brown Corpus, the word "the"?

They are two different levels of question. The first leads to better
assumptions  and thus advancement in our understanding of the concept
(e.g, 'general English'). The second leads to more accurate results
based on assumptions (e..g, ways to count word frequencies).

> #5 Laurence wrote:
>
>>In our field, the corpus is the starting point. By comparing the
>>results of previous corpora studies, we build *better* corpora (for a
>>particular language feature), and ultimately better models (of that
>>language feature).
>
>
>
> I'm not sure what you mean by 'language feature'. The corpus is collected
> on external criteria, the 'language features' emerge from the analysis?

Here, I mean things like "connectives", "past tense", etc. "Past
tense" does not emerge from the analysis. If we want to investigate
"past tense usage in research paper methods sections ", we could
collect methods sections in applied linguistics research papers.  But,
someone could build a better corpus by collecting methods sections in
multiple disciplines. I hope that clarifies this point.

>Its (corpus) appropriacy or not - for a subsequently specified purpose - is
> an evaluation we make in response to its use as a research dataset and
> the 'possibly related' observations made from its analysis. The reader judges
> the correlation between the data and the findings, and the extent of the
> predictive (probabilistic) power of the statements to other texts/datasets.

Agreed. But, see my earlier point about the need for the researcher to
have a basis for making a claim in the first place. I would argue that
the author *must* start with the assumption (see the point above) that
the corpus is representative of the target domain (rightly or
wrongly), before any claims can be made.

So perhaps a good definition of corpus is the following:

"A corpus is a collection of (digitized) texts that is *assumed to be*
representative of a target domain."

To me this captures everything we have discussed in this thread and
also addresses the issues of representativeness mentioned above. The
"assumed to be" wording is critical because it is the foundation of
all linguistic inquiry but it also addresses the reality that the
assumption might be wrong.

Sorry to drag this discussion on even longer!

Laurence.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora