[Corpora-List] What is corpora and what is not?
Michael Rundell
michael.rundell at lexmasterclass.com
Wed Oct 3 18:44:07 UTC 2012
Trevor - thanks for your kind words about Atkins & Rundell OGPL.
But in fact the quote you mention is from John Sinclair not us (it's the
same quote Geoffrey Williams rgave, earlier in this thread). Sue Atkins and
I refer here to what Sinclair said, but then add 'this is not without its
problems', specifically casting doubt on the proposition that a corpus can
be 'representative'. On the whole i prefer Adam's simpler definition (in the
first response to the original question).
Michael Rundell
----- Original Message -----
From: "Trevor Jenkins" <trevor.jenkins at suneidesis.com>
To: <corpora at uib.no>
Sent: Wednesday, October 03, 2012 7:21 PM
Subject: Re: [Corpora-List] What is corpora and what is not?
On 3 Oct 2012, at 18:56, Graham White <graham at eecs.qmul.ac.uk> wrote:
> So the Corpus Iuris Civilis is not a corpus? …
In the same way that Stonehenge is not technically a henge … despite it's
name being the origin of the word.
A corpus is usually compiled with some purpose in mind, so the example
someone used earlier of the novels of Charles Dickens would constitute a
corpus if one were analysing his fiction. A more formal definition of corpus
that I use is quoted in Atkins and Rundell's "Oxford Guide to Practical
Lexicography" (p54), viz "a corpus is a collection of pieces of language
text in electronic form, selected according to external criteria to
represent, as far as possible, a language or language variety as a source of
data for linguistic research." Indeed the original questioner might do well
to read chapter 3 of that book in its entirety.
However, inclusion of the Dickens Journal Online material at the same time
as the novels might stop the dataset being considered a formal corpus. Or if
one had a bunch of texts, that included some Dickens and Austen and Elliot
(whether George or T S or the sisters) simply because the analyst likes them
doesn't make up the result a corpus --- unless they are representative of
some other usage, for example language variance in 19th century fiction over
time.
Worse would be a collection of texts in different language just because the
analyst likes to read them --- unless it is the same text in those different
language and the purpose is to analyse the translation process.
So to the original questioner, what is your purpose in wanting a corpus?
What are your criteria for texts being included? What analysis are you
likely to apply to those texts?
Regards, Trevor.
<>< Re: deemed!
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list