[Corpora-List] What is corpora and what is not?

WILLIAMS Geoffrey williams at univ-ubs.fr
Wed Oct 3 16:48:46 UTC 2012


Are we not slightly reinventing the wheel?

The nature of corpora has been discussed for years, EAGLES was about 
defining it. In 2005, John Sinclair enlarged upon the 1996 definition 
when he wrote :

> A corpus is a collection of pieces of language text in electronic 
> format, selected according to external criteria to represent, as far 
> as possible, a language or language variety as a source of data for 
> linguistic research.
Sinclair J. McH. . 2005. ‘Corpus and Text: Basic Principles’. In Wynne, 
M (ed). 2005. pp. 1-16.  Wynne, M (ed). 2005. Developing Linguistic 
Corpora: A Guide to Good Practice. Oxford: AHDS 6 -

It is also on the web!

Surely anyone involved in corpora has read the seminal works and does 
not need reminding that corpora are machine-readable, maybe samples or 
whole works etc. What has changed is the rise of internet corpora, but 
here too Kilgarriff and others have commented the situation in a way 
that both NLP and corpus linguistic users can feel at home with.

Best

Geoffrey

B


Le 03/10/2012 18:02, Graham White a écrit :
> I quite agree about machine-readability: the reason that we use the 
> Latin word corpus is that the Romans already had corpora, such as this 
> one: http://en.wikipedia.org/wiki/Corpus_Juris_Civilis
> (which is just as good a corpus as anything machine-readable).
>
> A corpus should possibly, also, be public and collected for some 
> purpose: the books on my bookshelf aren't a corpus, for example, but 
> if someone wanted to investigate them as an example of what a computer 
> scientist read, then they would be. But it's a hard criterion to 
> formulate.
>
> Graham
>
> On 03/10/12 16:12, Krishnamurthy, Ramesh wrote:
>> Hi Yuri
>>
>>
>>
>> I agree broadly with Adam.
>>
>>
>>
>> I would add a couple of points for clarification:
>>
>> (i) Some corpus *techniques* (eg word frequency lists, collocation) 
>> may be applied to any piece of text,
>>
>> eg to a single chapter in a novel by Dickens.
>>
>> (ii) The contents of a corpus determine the scope and nature of the 
>> statements one can make, and the degree
>>
>> of confidence with which we can make them:  eg a single chapter or 
>> even a single novel would only allow us to make
>>
>> limited statements/suggestions, with a lower degree of confidence; a 
>> complete collection of his novels would allow
>>
>> us to make more general statements about Dickens' novelistic style, 
>> with greater confidence, and we could for example
>>
>> compare the novels and discover developments in his novelistic style 
>> from the first novel to the last, etc.
>>
>>
>>
>> Kevin's comment about machine-readable reflects the age we live in, 
>> and the technology now available to many.
>>
>> I'm not sure about his distinction between 'document collection' and 
>> corpus, or what kind of annotation he means.
>>
>> For me, a corpus can be unannotated or annotated (eg with metadata 
>> about each text in the corpus, or POS-tags,
>>
>> semantic tags, pragmatic tags, discourse tags, etc).
>>
>>
>>
>> best
>>
>> Ramesh
>>
>> ----------------------------------------------------------------------------------- 
>>
>>
>> Date: Tue, 2 Oct 2012 19:21:21 +0700
>> From: "Yuri Tambovtsev" <yutamb at mail.ru>
>> Subject: [Corpora-List] What is corpora and what is not?
>> To: <corpora at uib.no>
>>
>> Dear corpora members, I do not understand, what corpora is and what 
>> corpora is not. Is the set the text of books by Charles Dickens is a 
>> Dickens corpora? What about the books of Ernst Hemingway and other 
>> writers? Looking forward to hearing your opinion to yutamb at mail.ru 
>> Yours sincerely Yuri Tambovtsev, Novosibirsk, Russia
>>
>> ------------------------------------------------------------------------------------ 
>>
>>
>> Date: Tue, 2 Oct 2012 15:11:11 +0100
>> From: Adam Kilgarriff <adam at lexmasterclass.com>
>> Subject: Re: [Corpora-List] What is corpora and what is not?
>> To: Yuri Tambovtsev <yutamb at mail.ru>
>> Cc: corpora at uib.no
>>
>> Yuri,
>>
>> a corpus is a collection of texts/speech. We call it a corpus when we 
>> view
>> it as an object of linguistics or literary research. The answers to your
>> questions are yes and yes.
>>
>> Adam
>>
>> ========================================
>> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
>> adam at lexmasterclass.com
>> Director Lexical Computing
>> Ltd<http://www.sketchengine.co.uk/>
>>
>> Visiting Research Fellow University of
>> Leeds<http://leeds.ac.uk>
>>
>> *Corpora for all* with the Sketch Engine <http://www.sketchengine.co.uk>
>>
>> *DANTE: a lexical database for
>> English<http://www.webdante.com>
>>
>> ---------------------------------------------------------------------------- 
>>
>>
>> Date: Tue, 2 Oct 2012 08:59:21 -0600
>> From: "Kevin B. Cohen" <kevin.cohen at gmail.com>
>> Subject: Re: [Corpora-List] What is corpora and what is not?
>> To: Yuri Tambovtsev <yutamb at mail.ru>
>> Cc: corpora at uib.no
>>
>> Hi, Yuri,
>>
>> Different people have differing definitions of what constitutes a
>> corpus. Here are a couple of them:
>>
>> Meyer:
>>
>> "a collection of texts or parts of texts upon which some general
>> linguistic analysis can be conducted"
>> "a body of text made available in computer-readable form for purposes
>> of linguistic analysis"
>>
>> McEnery and Wilson:
>>
>> McEnery & Wilson:
>> (i) (loosely) any body of text
>> (ii) (most commonly) a body of machine-readable text
>> (iii) (more strictly) a finite collection of machine-readable text,
>> sampled to be maximally representable of a language or variety
>>
>> You'll notice that a common element of the definitions is the notion
>> of machine-readability.
>>
>> Some people distinguish between a "document collection" and a corpus.
>> In this case, the difference is that a corpus has some sort of
>> annotations, while a document collection is a set of unannotated
>> documents. Sorry I don't have a citation for this.
>>
>> Kev
>>
>> -- 
>> Kevin Bretonnel Cohen, PhD
>> Biomedical Text Mining Group Lead, Computational Bioscience Program,
>> U. Colorado School of Medicine
>> 303-916-2417 (cell) 303-377-9194 (home)
>> http://compbio.ucdenver.edu/Hunter_lab/Cohen
>>
>>
>> _______________________________________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora

-- 

*
Professor Geoffrey WILLIAMS. MSc, PhD */
Director of Department for Document Management, Directeur du Département 
d'Ingénierie du document
LiCoRN - HCTI. /
------------------------------------------------------------------------
geoffrey.williams at univ-ubs.fr
tél. +33 (0)2 97 87 29 20 - fax. +33 (0)2 97 87 29 31
Faculté de Lettres Langues Sciences Humaines
et Sociales (LSHS)
4 rue Jean Zay
BP92113, 56321 LORIENT CEDEX
UNIVERSITÉ DE BRETAGNE-SUD
www.univ-ubs.fr / www.licorn.com

------------------------------------------------------------------------

New Book: European Identity: What the media say. Paul Bayley and 
Geoffrey Williams (eds). Oxford: OUP
http://ukcatalogue.oup.com/product/9780199602308.do



<http://www.univ-ubs.fr/>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121003/1ab65d95/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list