[Corpora-List] What is corpora and what is not?

Fri Oct 5 10:39:23 UTC 2012

Many apologies if the formatting of my previous posting caused problems

in recognising quotes and my comments on them. It looked better on my

screen... and perhaps trying to address many postings in one response

was a bad idea. I'll try a different layout this time...

trevor.jenkins at suneidesis.com<mailto:trevor.jenkins at suneidesis.com>

Hi Trevor

> a) "Corpora compilers want Michelin starred restaurant cuisine for themselves..."

> b) "the cuisine of choice of the group being studying is a big Mac...".

I would dispute both assumptions.

a) (i) I presume you are talking about 'general language corpora' aiming to describe a 'language or language variety'?

I think any corpus compiler with this purpose in mind would actually want to collect as wide a range of texts

as possible. (ii) but they may also be trying to describe some notional 'standard language', of the kind that is taught in

schools and is found in the textbooks and texts they are offered and encouraged to read, and recorded in dictionaries

and other language reference books. (iii) But I think the main problem is obtainability/ease of collection. (iv) To suggest

another metaphor, I think language texts can be considered as a pyramid, the apex being the most widely read/dispersed

which are also the easiest to collect (eg the transcript of an Obama speech, a newspaper article, etc), and tend to be less

spontaneous and only weakly (if at all) representative of more widespread informal conversational style. At the base of the

pyramid, we have millions of text producers with a limited audience, whose texts are mostly unrecorded. Indeed, the very

act of recording them may often distort their content. (v) The texts at the base - especially the spontaneous spoken ones - also

tend to be more idiosyncratic, and extremely context-dependent: full of background noise, networks of speakers, false starts,

self-corrections, hesitations, interruptions, topic shifts, etc. (vi) Also, ethically, there is a need to obtain permission from

all the speakers, and also to anonymise; both of these tasks are difficult, and the latter distorts the text. (vii) We would need

to collect an enormous number of these texts to overcome various types of skewing. (viii) So my answer to this problem would

be: to stop claiming 'representativeness of a language or language variety' and limit statements to 'the texts in this corpus',

which may be indicative, but not representative, of a language or language variety?

b) With cuisine as with language, we cannot claim with any certainty what people eat - or say. Your metaphor assumes

that we are only talking about non-home food, whereas many (most?) people eat at home? You yourself were planning

to emulate Mary Berry yesterday evening (hope the meal was enjoyable! :)... And Mac may be popular in some countries

among some groups, but 'the cuisine of choice'? And who said the group being studied is limited to 'Mac-eaters'? Every

language community is extremely diverse, the English-speaking community currently being the extreme case.

christopher.brew at gmail.com<mailto:christopher.brew at gmail.com>

Hi Chris

> "If anything, I undermined the case for a position that I am not at all interested in defending."

Sorry if I mistook your intention. You also said "In fact, I think that the very idea of "writing as a whole" is unhelpful,

and would much rather talk about specific cases of how and why people write." I am interested in both, the attempt

to describe the whole, as well as the unique features of individual cases.

trevor.jenkins at suneidesis.com<mailto:trevor.jenkins at suneidesis.com>

Hi again, Trevor

> "A individual's collection of liked (and also unliked) texts surely is a library not a corpus.

> Once that individual is famous... History might eventually upgrade the library to corpus."

This seems to contradict your 'Mac-eaters' metaphor? Why should we only be interested in the reading

(or book ownership; I'm quite happy to accommodate that distinction) habits of famous people?

> "storing them on the same device doesn't make them a corpus."

A corpus (in the modern sense) does depend on them all being available to be analysed by the same corpus

software. Again, the contents determine the statements: if we only look at items on someone's bookshelves,

we call it the 'corpus of books on X's bookshelves at a given moment'. We then do not need to consider other

media/formats, or whether X read them or not, books X may have read earlier/later, etc. I quite agree that if

we want to characterise X's reading habits, we would need to consider many other sources (media, internet,

etc. This is why an accurate description of and label for the corpus being studied is so important.

> "They would only be considered a corpus because of a study looking at the linguistic variation in that genre over time."

I think that linguistic analysis need not be the only ultimate goal. For example, a social historian might be interested

in the corpus for other reasons.

> "The sample of *real* use is even smaller and its representation not just weak but flat out comatose."

High-genre use is also *real* use. But I agree that proportion is the problem.

> "In the end all our definitions are subordinate to the question "what is going to be done with the texts?"

> It the answer to that which moves them from library to corpus."

I think we agree: once the collection of texts is digitized, and analysed using corpus techniques, it is a corpus?

best

Ramesh

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora