[Corpora-List] What is corpora and what is not?

Wed Oct 3 18:44:07 UTC 2012

Trevor - thanks for your kind words about Atkins & Rundell OGPL.

But in fact the quote you mention is from John Sinclair not us (it's the 
same quote Geoffrey Williams rgave, earlier in this thread). Sue Atkins and 
I refer here to what Sinclair said, but then add 'this is not without its 
problems', specifically casting doubt on the proposition that a corpus can 
be 'representative'. On the whole i prefer Adam's simpler definition (in the 
first response to the original question).

Michael Rundell

----- Original Message ----- 
From: "Trevor Jenkins" <trevor.jenkins at suneidesis.com>
To: <corpora at uib.no>
Sent: Wednesday, October 03, 2012 7:21 PM
Subject: Re: [Corpora-List] What is corpora and what is not?

On 3 Oct 2012, at 18:56, Graham White <graham at eecs.qmul.ac.uk> wrote:

> So the Corpus Iuris Civilis is not a corpus? …

In the same way that Stonehenge is not technically a henge … despite it's 
name being the origin of the word.

A corpus is usually compiled with some purpose in mind, so the example 
someone used earlier of the novels of Charles Dickens would constitute a 
corpus if one were analysing his fiction. A more formal definition of corpus 
that I use is quoted in Atkins and Rundell's "Oxford Guide to Practical 
Lexicography" (p54), viz "a corpus is a collection of pieces of language 
text in electronic form, selected according to external criteria to 
represent, as far as possible, a language or language variety as a source of 
data for linguistic research."  Indeed the original questioner might do well 
to read chapter 3 of that book in its entirety.

However, inclusion of the Dickens Journal Online material at the same time 
as the novels might stop the dataset being considered a formal corpus. Or if 
one had a bunch of texts, that included some Dickens and Austen and Elliot 
(whether George or T S or the sisters) simply because the analyst likes them 
doesn't make up the result a corpus --- unless they are representative of 
some other usage, for example language variance in 19th century fiction over 
time.

Worse would be a collection of texts in different language just because the 
analyst likes to read them --- unless it is the same text in those different 
language and the purpose is to analyse the translation process.

So to the original questioner, what is your purpose in wanting a corpus? 
What are your criteria for texts being included? What analysis are you 
likely to apply to those texts?

Regards, Trevor.

<>< Re: deemed!

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora 

_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora