<html>
<head>
<meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Are we not slightly reinventing the wheel?<br>
<br>
The nature of corpora has been discussed for years, EAGLES was about
defining it. In 2005, John Sinclair enlarged upon the 1996
definition when he wrote :<br>
<br>
<blockquote type="cite">A corpus is a collection of pieces of
language text in electronic format, selected according to external
criteria to represent, as far as possible, a language or language
variety as a source of data for linguistic research.</blockquote>
Sinclair J. McH. . 2005. ‘Corpus and Text: Basic Principles’. In
Wynne, M (ed). 2005. pp. 1-16. Wynne, M (ed). 2005. Developing
Linguistic Corpora: A Guide to Good Practice. Oxford: AHDS 6 -<br>
<br>
It is also on the web!<br>
<br>
Surely anyone involved in corpora has read the seminal works and
does not need reminding that corpora are machine-readable, maybe
samples or whole works etc. What has changed is the rise of internet
corpora, but here too Kilgarriff and others have commented the
situation in a way that both NLP and corpus linguistic users can
feel at home with.<br>
<br>
Best<br>
<br>
Geoffrey<br>
<br>
B<br>
<br>
<br>
<div class="moz-cite-prefix">Le 03/10/2012 18:02, Graham White a
écrit :<br>
</div>
<blockquote cite="mid:506C61B2.9040701@eecs.qmul.ac.uk" type="cite">I
quite agree about machine-readability: the reason that we use the
Latin word corpus is that the Romans already had corpora, such as
this one: <a class="moz-txt-link-freetext" href="http://en.wikipedia.org/wiki/Corpus_Juris_Civilis">http://en.wikipedia.org/wiki/Corpus_Juris_Civilis</a>
<br>
(which is just as good a corpus as anything machine-readable).
<br>
<br>
A corpus should possibly, also, be public and collected for some
purpose: the books on my bookshelf aren't a corpus, for example,
but if someone wanted to investigate them as an example of what a
computer scientist read, then they would be. But it's a hard
criterion to formulate.
<br>
<br>
Graham
<br>
<br>
On 03/10/12 16:12, Krishnamurthy, Ramesh wrote:
<br>
<blockquote type="cite">Hi Yuri
<br>
<br>
<br>
<br>
I agree broadly with Adam.
<br>
<br>
<br>
<br>
I would add a couple of points for clarification:
<br>
<br>
(i) Some corpus *techniques* (eg word frequency lists,
collocation) may be applied to any piece of text,
<br>
<br>
eg to a single chapter in a novel by Dickens.
<br>
<br>
(ii) The contents of a corpus determine the scope and nature of
the statements one can make, and the degree
<br>
<br>
of confidence with which we can make them: eg a single chapter
or even a single novel would only allow us to make
<br>
<br>
limited statements/suggestions, with a lower degree of
confidence; a complete collection of his novels would allow
<br>
<br>
us to make more general statements about Dickens' novelistic
style, with greater confidence, and we could for example
<br>
<br>
compare the novels and discover developments in his novelistic
style from the first novel to the last, etc.
<br>
<br>
<br>
<br>
Kevin's comment about machine-readable reflects the age we live
in, and the technology now available to many.
<br>
<br>
I'm not sure about his distinction between 'document collection'
and corpus, or what kind of annotation he means.
<br>
<br>
For me, a corpus can be unannotated or annotated (eg with
metadata about each text in the corpus, or POS-tags,
<br>
<br>
semantic tags, pragmatic tags, discourse tags, etc).
<br>
<br>
<br>
<br>
best
<br>
<br>
Ramesh
<br>
<br>
-----------------------------------------------------------------------------------
<br>
<br>
Date: Tue, 2 Oct 2012 19:21:21 +0700
<br>
From: "Yuri Tambovtsev" <a class="moz-txt-link-rfc2396E" href="mailto:yutamb@mail.ru"><yutamb@mail.ru></a>
<br>
Subject: [Corpora-List] What is corpora and what is not?
<br>
To: <a class="moz-txt-link-rfc2396E" href="mailto:corpora@uib.no"><corpora@uib.no></a>
<br>
<br>
Dear corpora members, I do not understand, what corpora is and
what corpora is not. Is the set the text of books by Charles
Dickens is a Dickens corpora? What about the books of Ernst
Hemingway and other writers? Looking forward to hearing your
opinion to <a class="moz-txt-link-abbreviated" href="mailto:yutamb@mail.ru">yutamb@mail.ru</a> Yours sincerely Yuri Tambovtsev,
Novosibirsk, Russia
<br>
<br>
------------------------------------------------------------------------------------
<br>
<br>
Date: Tue, 2 Oct 2012 15:11:11 +0100
<br>
From: Adam Kilgarriff <a class="moz-txt-link-rfc2396E" href="mailto:adam@lexmasterclass.com"><adam@lexmasterclass.com></a>
<br>
Subject: Re: [Corpora-List] What is corpora and what is not?
<br>
To: Yuri Tambovtsev <a class="moz-txt-link-rfc2396E" href="mailto:yutamb@mail.ru"><yutamb@mail.ru></a>
<br>
Cc: <a class="moz-txt-link-abbreviated" href="mailto:corpora@uib.no">corpora@uib.no</a>
<br>
<br>
Yuri,
<br>
<br>
a corpus is a collection of texts/speech. We call it a corpus
when we view
<br>
it as an object of linguistics or literary research. The answers
to your
<br>
questions are yes and yes.
<br>
<br>
Adam
<br>
<br>
========================================
<br>
Adam Kilgarriff <a class="moz-txt-link-rfc2396E" href="http://www.kilgarriff.co.uk/"><http://www.kilgarriff.co.uk/></a>
<br>
<a class="moz-txt-link-abbreviated" href="mailto:adam@lexmasterclass.com">adam@lexmasterclass.com</a>
<br>
Director Lexical Computing
<br>
Ltd<a class="moz-txt-link-rfc2396E" href="http://www.sketchengine.co.uk/"><http://www.sketchengine.co.uk/></a>
<br>
<br>
Visiting Research Fellow University of
<br>
Leeds<a class="moz-txt-link-rfc2396E" href="http://leeds.ac.uk"><http://leeds.ac.uk></a>
<br>
<br>
*Corpora for all* with the Sketch Engine
<a class="moz-txt-link-rfc2396E" href="http://www.sketchengine.co.uk"><http://www.sketchengine.co.uk></a>
<br>
<br>
*DANTE: a lexical database for
<br>
English<a class="moz-txt-link-rfc2396E" href="http://www.webdante.com"><http://www.webdante.com></a>
<br>
<br>
----------------------------------------------------------------------------
<br>
<br>
Date: Tue, 2 Oct 2012 08:59:21 -0600
<br>
From: "Kevin B. Cohen" <a class="moz-txt-link-rfc2396E" href="mailto:kevin.cohen@gmail.com"><kevin.cohen@gmail.com></a>
<br>
Subject: Re: [Corpora-List] What is corpora and what is not?
<br>
To: Yuri Tambovtsev <a class="moz-txt-link-rfc2396E" href="mailto:yutamb@mail.ru"><yutamb@mail.ru></a>
<br>
Cc: <a class="moz-txt-link-abbreviated" href="mailto:corpora@uib.no">corpora@uib.no</a>
<br>
<br>
Hi, Yuri,
<br>
<br>
Different people have differing definitions of what constitutes
a
<br>
corpus. Here are a couple of them:
<br>
<br>
Meyer:
<br>
<br>
"a collection of texts or parts of texts upon which some general
<br>
linguistic analysis can be conducted"
<br>
"a body of text made available in computer-readable form for
purposes
<br>
of linguistic analysis"
<br>
<br>
McEnery and Wilson:
<br>
<br>
McEnery & Wilson:
<br>
(i) (loosely) any body of text
<br>
(ii) (most commonly) a body of machine-readable text
<br>
(iii) (more strictly) a finite collection of machine-readable
text,
<br>
sampled to be maximally representable of a language or variety
<br>
<br>
You'll notice that a common element of the definitions is the
notion
<br>
of machine-readability.
<br>
<br>
Some people distinguish between a "document collection" and a
corpus.
<br>
In this case, the difference is that a corpus has some sort of
<br>
annotations, while a document collection is a set of unannotated
<br>
documents. Sorry I don't have a citation for this.
<br>
<br>
Kev
<br>
<br>
--
<br>
Kevin Bretonnel Cohen, PhD
<br>
Biomedical Text Mining Group Lead, Computational Bioscience
Program,
<br>
U. Colorado School of Medicine
<br>
303-916-2417 (cell) 303-377-9194 (home)
<br>
<a class="moz-txt-link-freetext" href="http://compbio.ucdenver.edu/Hunter_lab/Cohen">http://compbio.ucdenver.edu/Hunter_lab/Cohen</a>
<br>
<br>
<br>
_______________________________________________
<br>
UNSUBSCRIBE from this page:
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
<br>
Corpora mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<br>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
<br>
<br>
</blockquote>
<br>
<br>
_______________________________________________
<br>
UNSUBSCRIBE from this page: <a class="moz-txt-link-freetext" href="http://mailman.uib.no/options/corpora">http://mailman.uib.no/options/corpora</a>
<br>
Corpora mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:Corpora@uib.no">Corpora@uib.no</a>
<br>
<a class="moz-txt-link-freetext" href="http://mailman.uib.no/listinfo/corpora">http://mailman.uib.no/listinfo/corpora</a>
<br>
</blockquote>
<br>
<div class="moz-signature">-- <br>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<style type="text/css">
<!--
.Style1 {font-family: Arial, Helvetica, sans-serif}
.Style5 {font-size: 12px}
.Style6 {font-size: 14px}
-->
</style>
<p><span class="Style1"><span class="Style6"><strong><br>
Professor Geoffrey WILLIAMS. MSc, PhD
</strong><i><br>
Director of Department for Document Management, Directeur
du
Département d'Ingénierie du document<br>
LiCoRN - HCTI.
</i></span><br>
------------------------------------------------------------------------<br>
<span class="Style5">
<a class="moz-txt-link-abbreviated" href="mailto:geoffrey.williams@univ-ubs.fr">geoffrey.williams@univ-ubs.fr</a>
<br>
tél. +33 (0)2 97 87 29 20 - fax. +33 (0)2 97 87 29 31
<br>
Faculté de Lettres Langues Sciences Humaines
<br>
et Sociales (LSHS)
<br>
4 rue Jean Zay <br>
BP92113, 56321 LORIENT CEDEX<br>
UNIVERSITÉ DE BRETAGNE-SUD
<br>
<a class="moz-txt-link-abbreviated" href="http://www.univ-ubs.fr">www.univ-ubs.fr</a>
/ <a class="moz-txt-link-abbreviated" href="http://www.licorn.com">www.licorn.com</a><br>
</span></span></p>
<hr style="width: 100%; height: 2px;">
<p>New Book: European Identity: What the media say. Paul Bayley
and Geoffrey Williams (eds). Oxford: OUP<br>
<a href="http://ukcatalogue.oup.com/product/9780199602308.do">http://ukcatalogue.oup.com/product/9780199602308.do</a><br>
</p>
<p><br>
</p>
<p>
<a href="http://www.univ-ubs.fr/" target="_blank"><br>
</a></p>
</div>
</body>
</html>