[Corpora-List] What is corpora and what is not?

Krishnamurthy, Ramesh r.krishnamurthy at aston.ac.uk
Wed Oct 3 15:12:10 UTC 2012


Hi Yuri



I agree broadly with Adam.



I would add a couple of points for clarification:

(i) Some corpus *techniques* (eg word frequency lists, collocation) may be applied to any piece of text,

eg to a single chapter in a novel by Dickens.

(ii) The contents of a corpus determine the scope and nature of the statements one can make, and the degree

of confidence with which we can make them:  eg a single chapter or even a single novel would only allow us to make

limited statements/suggestions, with a lower degree of confidence; a complete collection of his novels would allow

us to make more general statements about Dickens' novelistic style, with greater confidence, and we could for example

compare the novels and discover developments in his novelistic style from the first novel to the last, etc.



Kevin's comment about machine-readable reflects the age we live in, and the technology now available to many.

I'm not sure about his distinction between 'document collection' and corpus, or what kind of annotation he means.

For me, a corpus can be unannotated or annotated (eg with metadata about each text in the corpus, or POS-tags,

semantic tags, pragmatic tags, discourse tags, etc).



best

Ramesh

-----------------------------------------------------------------------------------

Date: Tue, 2 Oct 2012 19:21:21 +0700
From: "Yuri Tambovtsev" <yutamb at mail.ru>
Subject: [Corpora-List] What is corpora and what is not?
To: <corpora at uib.no>

Dear corpora members, I do not understand, what corpora is and what corpora is not. Is the set the text of books by Charles Dickens is a Dickens corpora? What about the books of Ernst Hemingway and other writers? Looking forward to hearing your opinion to yutamb at mail.ru Yours sincerely Yuri Tambovtsev, Novosibirsk, Russia

------------------------------------------------------------------------------------

Date: Tue, 2 Oct 2012 15:11:11 +0100
From: Adam Kilgarriff <adam at lexmasterclass.com>
Subject: Re: [Corpora-List] What is corpora and what is not?
To: Yuri Tambovtsev <yutamb at mail.ru>
Cc: corpora at uib.no

Yuri,

a corpus is a collection of texts/speech. We call it a corpus when we view
it as an object of linguistics or literary research. The answers to your
questions are yes and yes.

Adam

========================================
Adam Kilgarriff <http://www.kilgarriff.co.uk/>
adam at lexmasterclass.com
Director Lexical Computing
Ltd<http://www.sketchengine.co.uk/>

Visiting Research Fellow University of
Leeds<http://leeds.ac.uk>

*Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>

*DANTE: a lexical database for
English<http://www.webdante.com>

----------------------------------------------------------------------------

Date: Tue, 2 Oct 2012 08:59:21 -0600
From: "Kevin B. Cohen" <kevin.cohen at gmail.com>
Subject: Re: [Corpora-List] What is corpora and what is not?
To: Yuri Tambovtsev <yutamb at mail.ru>
Cc: corpora at uib.no

Hi, Yuri,

Different people have differing definitions of what constitutes a
corpus. Here are a couple of them:

Meyer:

"a collection of texts or parts of texts upon which some general
linguistic analysis can be conducted"
"a body of text made available in computer-readable form for purposes
of linguistic analysis"

McEnery and Wilson:

McEnery & Wilson:
(i) (loosely) any body of text
(ii) (most commonly) a body of machine-readable text
(iii) (more strictly) a finite collection of machine-readable text,
sampled to be maximally representable of a language or variety

You'll notice that a common element of the definitions is the notion
of machine-readability.

Some people distinguish between a "document collection" and a corpus.
In this case, the difference is that a corpus has some sort of
annotations, while a document collection is a set of unannotated
documents. Sorry I don't have a citation for this.

Kev

--
Kevin Bretonnel Cohen, PhD
Biomedical Text Mining Group Lead, Computational Bioscience Program,
U. Colorado School of Medicine
303-916-2417 (cell) 303-377-9194 (home)
http://compbio.ucdenver.edu/Hunter_lab/Cohen


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list