[Corpora-List] What is corpora and what is not?

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Thu Oct 4 13:32:45 UTC 2012


#1 I still feel that my initial suggestion covers most of the issues that have been raised subsequently.

a) quantitative > computational TECHNIQUES. There are many examples of quantitative techniques

being used long before computers, eg Cruden's Bible Concordance (1737)...

b) CONTENTS > STATEMENTS; the former constrains the scope and reliability of the latter.



However, to consider some of the points raised...



#2 DEFINITIONS OF CORPUS - [Geofrey Williams]: "Are we not slightly reinventing the wheel?
The nature of corpora has been discussed for years, EAGLES was about  defining it. In 2005, John Sinclair enlarged upon the 1996 definition when he wrote : A corpus is a collection of pieces of language text in electronic
format, selected according to external criteria to represent, as far  as possible, a language or language variety as a source of data for  linguistic research." - I don't think this is a case of reinventing the wheel. All definitions are necessarily provisional, and the terms used and their referents need to be constantly re-examined. Definitions are useful as a springboard for discussion, and as indicators/clues. Sinclair's definition requires us to interpret

several terms and concepts, eg do we take 'language or language variety' to include 'the language

of Chapter 3 of a Dickens' novel' or do we take a narrower view and see the definition as focussing

on language description more broadly (eg English, Australian English, etc)? [Kevin B Cohen] Meyer's

definitions: "a collection of texts or parts of texts upon which some general linguistic analysis can be

conducted" hinges on interpretation of 'general'? "a body of text made available in computer-readable

form for purposes of linguistic analysis" depends on whether we accept that sociologists, etc may use

corpora for purposes other than linguistic analysis (ie even though linguistic analysis may be conducted,

it is not the end-goal). McEnery & Wilson's definitions: "(i) (loosely) any body of text (ii) (most commonly)

a body of machine-readable text (iii) (more strictly) a finite collection of machine-readable text,
sampled to be maximally representable of a language or variety" - the strict definition adds other

problems, eg 'finite collection' invalidates monitor corpora, etc [Geoffrey Williams]: "Different disciplines may use the word corpus in different ways" - agreed. That's why I think this discussion is useful in highlighting

what corpus linguists consider a corpus to be. [Mike Maxwell]: "BTW, I'm surprised no one here has resorted to the (English) corpora to decide the meaning of "corpora." - technical terms are defined by the technical community? A general corpus would give us many non-technical versions. A corpus of corpus linguistics texts would produce a similar array of definitions to those being proposed or adduced in this discussion? Many of the contributors to

the discussion would be authors of the texts in the corpus?



#3 CORPUS USERS - [Geoffrey Williams]: "Surely anyone involved in corpora has read the seminal works".

I don't know whether Yuri has or not. But the fact that he has been (to my knowledge) a contributor to this list for some time, and still asks this question, suggests that the definition still needs clarification. And while I said

that quantitative/corpus techniques could be applied to a chapter of a Dickens' novel, I would not call

that chapter a 'corpus'? There may also be many newcomers to the field who come to this list without

substantial prior reading or experience?



#4 DATA MEDIUM/FORMAT - [Graham White]: the reason that we use the Latin word corpus is that the Romans already had corpora, such as this one: http://en.wikipedia.org/wiki/Corpus_Juris_Civilis (which is just as good a corpus as anything machine-readable)." - texts have been written on clay tablets, palm-leaves, or papyrus, or

printed on paper. The dominant medium today is electronic, and older texts can be digitized. The term

'corpus' has historical referents and modern referents. However, the term 'corpus linguistics' is a more recent coinage, and deals with the latter. Pre-electronic corpora can be digitized. I don't think any Roman linguists

would have called themselves 'corpus linguists'?



#5 PUBLIC ACCESSIBILITY - [Ciarán Ó Duibhín]: "an important distinction...between corpora where the full text is included, and those where, normally for legal reasons, the full text is withheld. These latter typically consist of the text in encrypted form, together with a word-index, and some software to access the index, to retrieve and decrypt short segments of the text in response to queries". I'm not aware of the latter type, but accessibility or nature

of access does not affect its description as a corpus? [Chris Brew]: "Since I have no idea what is on someone else's bookshelf, it doesn't serve my purposes, and I choose to punish it slightly by declining to call it a corpus." - Not innocent until proven guilty? But I agree that a description of the corpus would help to validate any analysis.

#6 PURPOSE - [Graham White]: "A corpus should possibly, also, be public and collected for some purpose: the books on my bookshelf aren't a corpus, for example, but if someone wanted to investigate them as an example of what a computer scientist read, then they would be." - I think you have answered your own doubts? Some corpora are premeditated and designed collections. But your book collection can be someone else's corpus: someone who

wants to analyse them as indicative of Graham White's reading habits. To be representative of computer scientists, one would need to look at many more computer scientists' book collections? This may be what [Adam Kilgarriff] meant when he said "We call it a corpus when we view it as an object of linguistics or literary research". He might

wish to extend his user group... or just say 'an object of research'? To reiterate: corpus contents affect the scope and reliability of the statements that can be made? [Jernej Vicic]: "Does it have to be an object of linguistics or literary research for a collection of text to be called a corpus? I would broaden the scope to any kind of research."

- agreed! [Trevor Jenkins]: "a bunch of texts, that included some Dickens and Austen and Elliot (whether George or T S or the sisters) simply because the analyst likes them doesn't make up the result a corpus --- unless they are representative of some other usage, for example language variance in 19th century fiction over time" - To reiterate: corpus contents affect the scope and reliability of the statements that can be made? If the analyst's tastes in reading are the goal of research, that 'bunch of texts' becomes an acceptable corpus? [Rich Cooper]: "But the
restriction to only linguistics purposes is no longer representative of usage...It's gotten a lot less simple than just
linguistics" - agreed.



#7 REPRESENTATIVENESS - [Trevor Jenkins]: "I have serious reservations about corpora compilation under that regime because it can result in corpora containing only high genre texts such as Dickens novels, rather than "English as she is spoke"" - agreed. When trying to describe a language (English, Swedish, etc), 'corpus contents' becomes a highly problematic issue. The problem is that the 'population' is so vast, and the 'sample' is proportionally so small, that representativity is extremely weak. [Chris Brew]: "I do like the Brown Corpus, which is defined to be representative of 15 broad categories of writing, all first published in 1961 and all by native speakers of American English...I don't know how to give a precise operational definition of what the boundaries of the 15 broad categories, and I am not quite sure what "first published" or "native speaker of American English" would mean in practice... However, in this case, it really is the thought that counts. By articulating the principles that guided the creation of the corpus, Kucera and Francis opened the way to the creation of comparable corpora for other languages and other years." - I'm afraid you undermined your own case by referring to the categorial problems. An even bigger question is: to what extent are these 15 categories truly "representative" of writing as a whole? Basing corpora for other languages and other years on a problematic/deficient model is surely not a validation of this model? And comparability seems very weak if we compare 2012 with 1961, if only because of the internet? ...So we seem to come down to 'you like it'? [Trevor Jenkins]: "I don't believe that scripted and rehearsed productions represent spoken language very well." - agreed. [Piotr Pezik]: "I think the American Soap Operas Corpus, although a very valuable resource in its own right, represents written-to-be-spoken rather than spoken language" - agreed, on both counts!



#8 SUBCORPORA - [Rich Cooper]: "given a buncha texts from these sources, the issue is more about how to compartmentalize the samples by whatever criteria you want to study, partition them into equivalence classes by the criteria you annotate them to render" - I think I agree, but I think a lot of work is being done on identifying web genres? While I think that a corpus should be assembled on external (ie non-linguistic) criteria, would it be acceptable to apply linguistic criteria for dividing the texts into subcorpora, obviously after stating that this is how the subdivisions were created?



Hope this helps.

Best

Ramesh




_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list