[Corpora-List] What is corpora and what is not?

Trevor Jenkins trevor.jenkins at suneidesis.com
Thu Oct 4 18:47:01 UTC 2012


On 4 Oct 2012, at 14:32, "Krishnamurthy, Ramesh" <r.krishnamurthy at aston.ac.uk> wrote:

Some really weird formatting in your message makes it hard to follow the chain of attributions and commenter within the various sections. I'll try to address you comments to my earlier ones.

> [Trevor Jenkins]: "a bunch of texts, that included some Dickens and Austen and Elliot (whether George or T S or the sisters) simply because the analyst likes them doesn't make up the result a corpus --- unless they are representative of some other usage, for example language variance in 19th century fiction over time" - To reiterate: corpus contents affect the scope and reliability of the statements that can be made? If the analyst's tastes in reading are the goal of research, that 'bunch of texts' becomes an acceptable corpus?

A individual's collection of liked (and also unliked) texts surely is a library not a corpus. Once that individual is famous other people might subsequently be interested in knowing what books a particular person owned (and here I distinguish between owning and reading -- they are not the same thing even if marketeers try to convince us otherwise) whether physical copies aka books or electronic copies aka ebooks. History might eventually upgrade the library to corpus.

Some readers may try to obfuscate what they own/read by putting texts on an e-reader. For example, I would not read E L James' recent trilogy in a print edition -- not because I'm embarrassed of the subject matter rather I refuse to be a secondary advertiser for such poor writing -- but I do have them, along with earlier more notorious examples (Jacobian, on through to Victorian, and eventually 20th century) drawn from the same genre, on my iPad and hence on my desktop machine; storing them on the same device doesn't make them a corpus. They would only be considered a corpus because of a study looking at the linguistic variation in that genre over time.

> #7 REPRESENTATIVENESS - [Trevor Jenkins]: "I have serious reservations about corpora compilation under that regime because it can result in corpora containing only high genre texts such as Dickens novels, rather than "English as she is spoke"" - agreed. When trying to describe a language (English, Swedish, etc), 'corpus contents' becomes a highly problematic issue. The problem is that the 'population' is so vast, and the 'sample' is proportionally so small, that representativity is extremely weak.

It is indeed. The sample of *real* use is even smaller and its representation not just weak but flat out comatose.

In the end all our definitions are subordinate to the question "what is going to be done with the texts?" It the answer to that which moves them from library to corpus.

Regards, Trevor.

<>< Re: deemed!


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list