[Corpora-List] What is corpora and what is not?

Mon Oct 8 12:32:38 UTC 2012

I think we are beginning to go over previously covered ground...

but this is a) a feature of discussion mode b) sometimes useful

in forcing us to rephrase previous statements?

#1 Laurence wrote:

>The "digitized" part of the above definition seems to imply that
>certain hardware/software must be applied in the analysis, i.e.,
>computers and concordancers. But, surely, we can apply the same
>analytical techniques without the need for computers and software
>(although the analysis would take a *lot* longer). If we remove the
>"digitized" part of the definition, we are left with the following:
>***Corpus = A collection of texts***
>I'm not sure that I'm very happy with this definition, either!

Ramesh had written:

>a) 'digitized' indicates that this definition only refers to the modern

>sense of corpus, as used in corpus linguistics.
>But all definitions are necessarily context-dependent.

If you want to define 'corpus' to include historical uses, that's fine.

But in the context of 'corpus linguistics', I think computers are a sine qua non?

#2 Laurence wrote:

>It seems to me that many corpus studies attempt to describe *language
>usage in some target domain* based on the analysis of a corpus.

Language description may have been the focus in earlier corpus linguistics.

The field has developed since then, and many corpus studies use language

description as part of the means to making statements about wider social

issues, eg forensics, pedagogy, politics, etc?

#3 Laurence wrote:
>I assume that the implication here is that the corpus is in some way
>representative of the target domain (for a particular feature). If it
>isn't and the corpus is simply "a (digitized) collection of texts", it
>means that none of these authors can *assume* that their results are
>generalizable in any way.

As in all fields, the corpus/dataset we have collected is all we can actually analyse.

The representativeness or not of this dataset to some other notional dataset is part

of the claim being made by the researcher, and readers can evaluate the degree of

validity of the claim.

Surely no researcher can *assume* anything? The generalizability or not of their

statements/results is again a matter for reader judgment?

#4 Laurence wrote:

>Others might question the representative of our corpus (and hence our results),

>and may develop *better* corpora (i.e. *more representative* corpora) that
>lead to *improved* findings, leading to *better* generalizations. To me, this

>is the power of research: Research builds on previous research.

I agree. That's why I wrote "corpus research is always enhanced by comparing the
selected dataset with other relevant text collections".

#5 Laurence wrote:

>In our field, the corpus is the starting point. By comparing the
>results of previous corpora studies, we build *better* corpora (for a
>particular language feature), and ultimately better models (of that
>language feature).

I'm not sure what you mean by 'language feature'. The corpus is collected

on external criteria, the 'language features' emerge from the analysis?

#6 Laurence wrote:

>If each corpus is just "a (digitized) collections
>of texts", then one corpus is not inherently better than another, and
>so the description that is derived from that corpus is not inherently
>better than any other (assuming both use the same analytical
>techniques). So, we get no development in our understanding of how
>that feature acts in the domain as a whole. All we get is a set of
>*possibly related* observations about the target feature, none of
>which have any predictive power about how the feature will work in a
>new, as yet unseen, text.

The corpus itself is indeed 'just' a collection of - digitized, in the modern

sense - texts. Its appropriacy or not - for a subsequently specified purpose - is

an evaluation we make in response to its use as a research dataset and

the 'possibly related' observations made from its analysis. The reader judges

the correlation between the data and the findings, and the extent of the

predictive (probabilistic) power of the statements to other texts/datasets.

#7 Trevor wrote:

>I would change it to "corpus = text collected for the purpose of quantitative

>and/or qualitative analysis".

If you recall, this is what I started with. It is the application of corpus techniques

that makes the collection 'a corpus' - *within the field of corpus linguistics*. But

other researchers may use the same dataset for other purposes; and even a

linguist's ultimate purpose may lie beyond the linguistic analysis itself...?

#8 Amsler (Robert?) wrote:

>The simplest summary I came away with is that a corpus is a set of
>texts that has a proposed purpose of study. At least one person must
>have an intention for the collection to serve a purpose.

Agreed.

#9 Amsler wrote:

>The unanswered question is whether a corpus has to even be texts, or can
>it be a corpus of other types of data; such as corpus of lexical
>items, a corpus of musical recordings, or a corpus of video clips.

I think you are conflating a) medium/mode and b) unit of collection?

a) Audio (including musical) corpora and Video corpora can indeed be called corpora.

We just happen to be at a stage of technological development when harware and

software problems prevent most people from easily collecting and analysing such

corpora. Even most audio corpora are currently transcribed before analysis.

Music and video also require considerably more than linguistic expertise in their

analysis.

> cf Radev wrote: "Can't there be corpora made up of images or songs?"

b) I did mention some of the problems in defining 'text'. However, I find it difficult

to conceive of a 'corpus of lexical items'. A lexical item is not a text. 'Text' implies a unit

of language larger than a lexical item, with some external integrity. That integrity

needs to be specified by the researcher - inevitably we cause some damage in

the process, eg separating an article from an issue of a newspaper loses its

possible intertextual features / coherence within that issue? Ideally, I would

want to consider language texts as part of the 'context of culture/situation'

(Malinowski/Firth) or 'environment of the text (Halliday), and the internet

is helping towards this, but the analytical technology (corpus software) is

not yet sufficiently advanced?

>cf Alex wrote: "There will always be a compromise between what you really

>want and what you can reasonably collect.

#10 Amsler wrote:

>This definition of a corpus means that it may not be recognized as a
>corpus by anyone else other than its collector/creator. It may appear
>to be a random set of pages, a hapstance collection of books, etc.
>unless you figure out what they share in common.

I agree. It is up to the creator/analyst to tell us why they consider the collection

to be a corpus, and for us to decide whetehr we accept their argument or not.

#11 Amsler wrote:

>And note that 'randomness' is a purpose. Some of the most important corpora are
>those whose purpose is to be a random sample (or 'representative')
>sample of something. The Brown Corpus tried to be representative by
>being random.

Randomness may indeed be a purpose, but I'm not sure that it is achievable.

I don't think human beings can make random selections. I don't know enough

about maths/computer science to know what 'random' means in computer terms.

But when you are working with such small samples (corpora) with respect to a population

(all the texts in a language), doesn't that make 'true randomness' impossible?

I don't think Brown was random or representative. It selected texts that were available

in a priori specified categories? I think representativeness is a relative and evaluative notion.

#12 Amsler wrote:

>This is why a corpus needs an explanation of its properties, its
>reason for it being a corpus, to guarantee its recognition as a corpus
>and its utility to others.

Agreed.

#13 Amsler wrote:

>How to make a corpus that adheres to "best practices" would be more
>useful than deciding on whether someone's purposeful collection of
>text qualified to be called a corpus by everyone.

Agreed. Geoffrey Williams referred earlier to the EU Eagles project.

NERC, TEI, LREC, etc have all contributed to such discussions.

#14 John Sowa:

Mostly agreed. But we are moving beyond 'texts/speech' to video corpora?

I especially agree that the qualifiers attached to the corpus (eg its name, acronym, etc)

indicate its relationship to other similar datasets.

#15 Ken wrote:

>assembled for the purpose of linguistic research.

Attempting to be inclusive, as Amsler suggested, I thought that the ultimate purpose could

be extended beyond linguistics?

 #16 Trevor wrote:

>Or possibly better transcripts of the unredacted 24hour live feeds of reality shows like Big Brother.

>But there's still a selection process involved which skews the language used.
>Now the irony is that until such time as a large scale corpus of truly informal unrehearsed unscripted utterances

>exists we won't be able to do any comparisons between the lexical choices and grammar

>constructions of normal language.

>cf Alex wrote: "There will always be a compromise between what you really

>want and what you can reasonably collect.

I agree that all corpora are skewed by the selection process (see my comments on 'random'

and 'representative' above). However, you use many qualifiers ('truly informal, unrehearsed,

unscripted') that would be extremely difficult to evaluate, culminating in 'normal'?

best

Ramesh

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora