[Corpora-List] What is corpora and what is not?
Krishnamurthy, Ramesh
r.krishnamurthy at aston.ac.uk
Sat Oct 6 11:38:48 UTC 2012
Hi Himanshu
"...it would be nice if someone could take one for the team and summarize the above discussion.
A generalized form of definition would be much appreciated. Mr. Krishnamurthy is doing that to some
extent but his replies are too specific and the skeleton of the definition is not very clear, at least not to me."
I was trying to use specific examples to illustrate my general points.
My final suggestion was: "once the collection of texts is digitized, and analysed using corpus techniques, it is a corpus".
If you wish, I can simplify that even further, and say "a corpus is a digitized collection of texts", as one could argue
that the collection is a corpus even before any analytical techniques are applied to it. It is only within corpus linguistics
that quantitative techniques are usually applied before qualitative interpretations are made.
a) 'digitized' indicates that this definition only refers to the modern sense of corpus, as used in corpus linguistics.
But all definitions are necessarily context-dependent. A physicist and a geographer might define the same
entity/feature in different terms. A schoolteacher and a college professor might similarly define the same thing differently.
b) 'collection' allows for small corpora and large ones; larger ones are better when we are using statistical measures,
and also allow us to make more reliable and generalisable statements. But small corpora, especially
in situations where data is difficult to collect, or in pilot studies, are a necessary and normal part of the field.
c) the definition of 'text' is also context-dependent, and can only be made within a piece of research,
according to the aim of that research. All texts need to have some external unitary integrity, but the researcher
needs to specify which feature he/she is selecting to work with, and why. Then we can discuss whether the unit/text
selected was appropriate. If I want to make general statements about 'The Guardian newspaper', then I may want to
consider each daily issue of the newspaper as a text (the unity coming from the fact that the contents were published
together at a specific point in time). If I want to make statements about an individual Guardian journalist, then each article
needs to be considered as a text, and that journalist's articles would form the basis for my study. Informal spoken data may
be even more difficult to divide into 'texts', but again, as long as the researcher specifies the decisions made, and the reasons
for these, and applies them consistently, we can discuss the appropriacy of the units within the research.
Finally, in addition to the detailed analysis of the set of texts selected, corpus research is always enhanced by comparing the
selected dataset with other relevant text collections. In my earlier example, one can make additional statements about the
Guardian journalist by comparing his/her texts with the texts written by other Guardian journalists, with the Guardian texts
as a whole, with other newspapers, with other genres of data, and with large general corpora of the English language.
I hope this helps.
Best
Ramesh
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list