[Corpora-List] What is corpora and what is not?

Graham White graham at eecs.qmul.ac.uk
Wed Oct 3 16:02:58 UTC 2012


I quite agree about machine-readability: the reason that we use the 
Latin word corpus is that the Romans already had corpora, such as this 
one: http://en.wikipedia.org/wiki/Corpus_Juris_Civilis
(which is just as good a corpus as anything machine-readable).

A corpus should possibly, also, be public and collected for some 
purpose: the books on my bookshelf aren't a corpus, for example, but if 
someone wanted to investigate them as an example of what a computer 
scientist read, then they would be. But it's a hard criterion to formulate.

Graham

On 03/10/12 16:12, Krishnamurthy, Ramesh wrote:
> Hi Yuri
>
>
>
> I agree broadly with Adam.
>
>
>
> I would add a couple of points for clarification:
>
> (i) Some corpus *techniques* (eg word frequency lists, collocation) may be applied to any piece of text,
>
> eg to a single chapter in a novel by Dickens.
>
> (ii) The contents of a corpus determine the scope and nature of the statements one can make, and the degree
>
> of confidence with which we can make them:  eg a single chapter or even a single novel would only allow us to make
>
> limited statements/suggestions, with a lower degree of confidence; a complete collection of his novels would allow
>
> us to make more general statements about Dickens' novelistic style, with greater confidence, and we could for example
>
> compare the novels and discover developments in his novelistic style from the first novel to the last, etc.
>
>
>
> Kevin's comment about machine-readable reflects the age we live in, and the technology now available to many.
>
> I'm not sure about his distinction between 'document collection' and corpus, or what kind of annotation he means.
>
> For me, a corpus can be unannotated or annotated (eg with metadata about each text in the corpus, or POS-tags,
>
> semantic tags, pragmatic tags, discourse tags, etc).
>
>
>
> best
>
> Ramesh
>
> -----------------------------------------------------------------------------------
>
> Date: Tue, 2 Oct 2012 19:21:21 +0700
> From: "Yuri Tambovtsev" <yutamb at mail.ru>
> Subject: [Corpora-List] What is corpora and what is not?
> To: <corpora at uib.no>
>
> Dear corpora members, I do not understand, what corpora is and what corpora is not. Is the set the text of books by Charles Dickens is a Dickens corpora? What about the books of Ernst Hemingway and other writers? Looking forward to hearing your opinion to yutamb at mail.ru Yours sincerely Yuri Tambovtsev, Novosibirsk, Russia
>
> ------------------------------------------------------------------------------------
>
> Date: Tue, 2 Oct 2012 15:11:11 +0100
> From: Adam Kilgarriff <adam at lexmasterclass.com>
> Subject: Re: [Corpora-List] What is corpora and what is not?
> To: Yuri Tambovtsev <yutamb at mail.ru>
> Cc: corpora at uib.no
>
> Yuri,
>
> a corpus is a collection of texts/speech. We call it a corpus when we view
> it as an object of linguistics or literary research. The answers to your
> questions are yes and yes.
>
> Adam
>
> ========================================
> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
> adam at lexmasterclass.com
> Director Lexical Computing
> Ltd<http://www.sketchengine.co.uk/>
>
> Visiting Research Fellow University of
> Leeds<http://leeds.ac.uk>
>
> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>
> *DANTE: a lexical database for
> English<http://www.webdante.com>
>
> ----------------------------------------------------------------------------
>
> Date: Tue, 2 Oct 2012 08:59:21 -0600
> From: "Kevin B. Cohen" <kevin.cohen at gmail.com>
> Subject: Re: [Corpora-List] What is corpora and what is not?
> To: Yuri Tambovtsev <yutamb at mail.ru>
> Cc: corpora at uib.no
>
> Hi, Yuri,
>
> Different people have differing definitions of what constitutes a
> corpus. Here are a couple of them:
>
> Meyer:
>
> "a collection of texts or parts of texts upon which some general
> linguistic analysis can be conducted"
> "a body of text made available in computer-readable form for purposes
> of linguistic analysis"
>
> McEnery and Wilson:
>
> McEnery & Wilson:
> (i) (loosely) any body of text
> (ii) (most commonly) a body of machine-readable text
> (iii) (more strictly) a finite collection of machine-readable text,
> sampled to be maximally representable of a language or variety
>
> You'll notice that a common element of the definitions is the notion
> of machine-readability.
>
> Some people distinguish between a "document collection" and a corpus.
> In this case, the difference is that a corpus has some sort of
> annotations, while a document collection is a set of unannotated
> documents. Sorry I don't have a citation for this.
>
> Kev
>
> --
> Kevin Bretonnel Cohen, PhD
> Biomedical Text Mining Group Lead, Computational Bioscience Program,
> U. Colorado School of Medicine
> 303-916-2417 (cell) 303-377-9194 (home)
> http://compbio.ucdenver.edu/Hunter_lab/Cohen
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>


_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list