[Corpora-List] What is corpora and what is not?

Laurence Anthony anthony0122 at gmail.com
Sat Oct 6 14:02:24 UTC 2012


Ramesh wrote:

>If you wish, I can simplify that even further, and say "a corpus is a digitized collection of texts",
> as one could argue that the collection is a corpus even before any analytical techniques are
> applied to it. It is only within corpus linguistics that quantitative techniques
>are usually applied before qualitative interpretations are made.

The "digitized" part of the above definition seems to imply that
certain hardware/software must be applied in the analysis, i.e.,
computers and concordancers. But, surely, we can apply the same
analytical techniques without the need for computers and software
(although the analysis would take a *lot* longer). If we remove the
"digitized" part of the definition, we are left with the following:

***Corpus = A collection of texts***

I'm not sure that I'm very happy with this definition, either!

It seems to me that many corpus studies attempt to describe *language
usage in some target domain* based on the analysis of a corpus. I
assume that the implication here is that the corpus is in some way
representative of the target domain (for a particular feature). If it
isn't and the corpus is simply "a (digitized) collection of texts", it
means that none of these authors can *assume* that their results are
generalizable in any way. . The author would then need to compare
their results with those of other studies on other "(digitized)
collections of texts" and measure the similarity of the findings. Only
when comparisons of multiple studies of multiple "(digitized)
collections of texts" are performed can we finally know anything
*generalizable* about the domain (for that target feature).

Are we doing this?

It seems to me that we are starting out with a corpus that is assumed
to be representative of some target domain (for a specific feature),
and then observing that feature. Then, we assume that the results of
the observation are generalizable in some way. Others might question
the representative of our corpus (and hence our results), and may
develop *better* corpora (i.e. *more representative* corpora) that
lead to *improved* findings, leading to *better* generalizations. To
me, this is the power of research: Research builds on previous
research.

In our field, the corpus is the starting point. By comparing the
results of previous corpora studies, we build *better* corpora (for a
particular language feature), and ultimately better models (of that
language feature). If each corpus is just "a (digitized) collections
of texts", then one corpus is not inherently better than another, and
so the description that is derived from that corpus is not inherently
better than any other (assuming both use the same analytical
techniques). So, we get no development in our understanding of how
that feature acts in the domain as a whole. All we get is a set of
*possibly related* observations about the target feature, none of
which have any predictive power about how the feature will work in a
new, as yet unseen, text.

This is a really interesting discussion. I look forward to everybody's
thoughts on this matter.

Laurence.

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list