[Corpora-List] What is corpora and what is not?

Sat Oct 6 11:38:48 UTC 2012

Hi Himanshu

"...it would be nice if someone could take one for the team and summarize the above discussion.

A generalized form of definition would be much appreciated. Mr. Krishnamurthy is doing that to some

extent but his replies are too specific and the skeleton of the definition is not very clear, at least not to me."

I was trying to use specific examples to illustrate my general points.

My final suggestion was: "once the collection of texts is digitized, and analysed using corpus techniques, it is a corpus".

If you wish, I can simplify that even further, and say "a corpus is a digitized collection of texts", as one could argue

that the collection is a corpus even before any analytical techniques are applied to it. It is only within corpus linguistics

that quantitative techniques are usually applied before qualitative interpretations are made.

a) 'digitized' indicates that this definition only refers to the modern sense of corpus, as used in corpus linguistics.

But all definitions are necessarily context-dependent. A physicist and a geographer might define the same

entity/feature in different terms. A schoolteacher and a college professor might similarly define the same thing differently.

b) 'collection' allows for small corpora and large ones; larger ones are better when we are using statistical measures,

and also allow us to make more reliable and generalisable statements. But small corpora, especially

in situations where data is difficult to collect, or in pilot studies, are a necessary and normal part of the field.

c) the definition of 'text' is also context-dependent, and can only be made within a piece of research,

according to the aim of that research. All texts need to have some external unitary integrity, but the researcher

needs to specify which feature he/she is selecting to work with, and why. Then we can discuss whether the unit/text

selected was appropriate. If I want to make general statements about 'The Guardian newspaper', then I may want to

consider each daily issue of the newspaper as a text (the unity coming from the fact that the contents were published

together at a specific point in time). If I want to make statements about an individual Guardian journalist, then each article

needs to be considered as a text, and that journalist's articles would form the basis for my study. Informal spoken data may

be even more difficult to divide into 'texts', but again, as long as the researcher specifies the decisions made, and the reasons

for these, and applies them consistently, we can discuss the appropriacy of the units within the research.

Finally, in addition to the detailed analysis of the set of texts selected, corpus research is always enhanced by comparing the

selected dataset with other relevant text collections. In my earlier example, one can make additional statements about the

Guardian journalist by comparing his/her texts with the texts written by other Guardian journalists, with the Guardian texts

as a whole, with other newspapers, with other genres of data, and with large general corpora of the English language.

I hope this helps.

Best

Ramesh

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora