[Corpora-List] What is corpora and what is not?

Chris Brew christopher.brew at gmail.com
Wed Oct 3 22:53:24 UTC 2012


I love the second leg of Adam's definition, which has the lovely property
of defining a word by delineating (some of) the circumstances under which
we use it. Wittgenstein would approve.

It also crosses my mind that it is important that a corpus be a public
object. Ideally, we would want the corpus itself to be accessible to all.
But when that is not possible, we want the designers of the corpus to
provide and publish a precise, publicly accessible statement of the
defining characteristics of the corpus. Since I have no idea what is on
someone else's bookshelf, it doesn't serve my purposes, and I choose to
punish it slightly by declining to call it a corpus.

But  I do like the Brown Corpus, which is defined to be representative of
15 broad categories of writing, all first published in 1961 and all by
native speakers of American English. And I also quite like the idea of the
SuperBrown Corpus, which is like the Brown Corpus, except that it now
contains ALL the stuff published in 1961 by native speakers of American
English and falling into one of the categories. I know I can't actually
have the SuperBrown corpus, I don't know how to give a precise operational
definition of what the boundaries of the 15 broad categories, and I am not
quite sure what "first published" or "native speaker of American English"
would mean in practice, and I can't get hold of it all anyway, because some
of it has been irretrievably lost. However, in this case, it really is the
thought that counts. By articulating the principles that guided the
creation of the corpus, Kucera and Francis opened the way to the creation
of comparable corpora for other languages and other years. That is quite
something...

On Wed, Oct 3, 2012 at 12:53 PM, Jernej Vicic <jernej.vicic at upr.si> wrote:

> Dear Adam!
>
> Does it have to be an object of linguistics or literary research for a
> collection of text to be called a corpus? I would broaden the scope to any
> kind of research.
>
> Adam Kilgarriff wrote:
>
>> Yuri,
>>
>> a corpus is a collection of texts/speech.  We call it a corpus when we
>> view it as an object of linguistics or literary research.  The answers to
>> your questions are yes and yes.
>>
>> Adam
>>
>> On 2 October 2012 13:21, Yuri Tambovtsev <yutamb at mail.ru <mailto:
>> yutamb at mail.ru>> wrote:
>>
>>     __
>>
>>     Dear corpora members, I do not understand, what corpora is and what
>>     corpora is not. Is the set the text of books by Charles Dickens is a
>>     Dickens corpora? What about the books of Ernst Hemingway and other
>>     writers? Looking forward to hearing your opinion to yutamb at mail.ru
>>     <mailto:yutamb at mail.ru>  Yours sincerely Yuri Tambovtsev,
>>
>>     Novosibirsk, Russia
>>
>>     ______________________________**_________________
>>     UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
>>     Corpora mailing list
>>     Corpora at uib.no <mailto:Corpora at uib.no>
>>     http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>>
>>
>>
>>
>> --
>> ==============================**==========
>> Adam Kilgarriff <http://www.kilgarriff.co.uk/>
>> adam at lexmasterclass.com <mailto:adam at lexmasterclass.**com<adam at lexmasterclass.com>>
>>                                             Director
>>              Lexical Computing Ltd <http://www.sketchengine.co.**uk/<http://www.sketchengine.co.uk/>>
>>                Visiting Research Fellow                 University of Leeds
>> <http://leeds.ac.uk>     /Corpora for all/ with the Sketch Engine <
>> http://www.sketchengine.co.uk**>
>> /DANTE: a lexical database for English <http://www.webdante.com>
>>          /
>> ==============================**==========
>>
>>
>> ------------------------------**------------------------------**
>> ------------
>>
>>
>> ______________________________**_________________
>> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>>
>
> ______________________________**_________________
> UNSUBSCRIBE from this page: http://mailman.uib.no/options/**corpora<http://mailman.uib.no/options/corpora>
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/**listinfo/corpora<http://mailman.uib.no/listinfo/corpora>
>



-- 
Chris Brew, Educational Testing Service
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121003/5b66324e/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list