[Corpora-List] What is corpora and what is not?

Martin Mueller martin.mueller at mac.com
Mon Oct 8 19:45:01 UTC 2012


I agree with much of Patrick's analysis and add some comments of my own.
The word 'corpus' has had a quite special meaning in Linguistics, but that
meaning may be challenged by the many ways in which NLP routines are
moving into other disciplines.

It helps to remember that 'corpus' is just the Latin word for 'body.' The
minimal meaning of a 'corpus' is a collection of things that in some way
add up to more than the sum of their parts, if only in the minimal sense
that the parts can be seen in the context of some whole. That whole may be
an 'authored' identity. The Roman corpus iuris is like that, and so is the
Bible, even though its parts are the works of many hand. The corpus
vasorum (ancient vases) or corpus inscriptionum Latinarum (roman
inscriptions) are not authored entities in that way but purposeful and
variously aggregations by later scholars.

In Linguistics the standard meaning of 'corpus' seems to have been
something like 'a systematic aggregation of stretches of language, whether
written or spoken, for the special purpose of quantitatively based and
computationally assisted analysis'.  The Brown corpus is the classic
example. Such corpora are typically 'balanced' in careful ways, and they
are likely to consist of passages of roughly equal length chosen from
various sources. 

If you come from outside corpus linguistics, that definition of a corpus
is a quite intelligible restriction of the broader term to the needs of a
particular discipline at a particular time. But it shouldn't be thought as
something canonical. If you are a literary scholarly and want to practise
"corpus-wide" analysis of this or that, whether plays by shakespeare's
contemporaries or 1001 English novels of the 19th century, you would
almost certainly not restrict your corpus to snatches of texts --although
you might base your interpretation on individual passages (in the manner
of Auerbach's very pre-computational Mimesis).

So a 'corpus' is just a body of stuff organized according to some purpose.

On 10/8/12 10:27 AM, "Patrick Juola" <juola at mathcs.duq.edu> wrote:

>On Mon, Oct 8, 2012 at 11:07 AM, Laurence Anthony <anthony0122 at gmail.com>
>wrote:
>> On Mon, Oct 8, 2012 at 11:44 PM, Patrick Juola <juola at mathcs.duq.edu>
>>wrote:
>>> Actually, that's a pretty good illustration of why definitions are
>>> unimportant and why this whole discussion is rather silly.
>>
>> You say "definitions are unimportant" and this discussion is "silly".
>>
>> Hmm, many people have contributed. Are we all just being silly?
>
>Yes, bluntly.
>
>
>>
>>> There's a reason that scientists don't define the meanings of most of
>>> the broad terms they use.  It wastes time on unproductive inquiry.
>>
>> Can you give me an example of one of the "broad terms" that a
>> scientist (e.g. physicist) uses which is not defined?
>
>"Life."  (biology)  "Matter." (physics)  "Mind." (psychology)
>"Thought." (psychology, again)  "Illness."  (medicine)
>
>Even "sleep" is tricky to define, as any anaesthesiologist will tell
>you.    The question of exactly where and how a patient loses
>consciousness is of course, key to this field of medicine -- but our
>simple idea of a thin bright definitional line between "sleep" and
>"waking" (or "conscious" and "unconscious") is tremendously
>oversimplified.   There are dozens of processes involved, many of
>which interact, not all of which are turned off at the same rate by
>the same process or drug.  Trying to make a definition stretch to
>cover all these phenomenon is not just silly, but stupid.   Instead
>the practicing scientists focus on defining specialist vocabulary to
>describe the specific phenomena they're interested in, just as corpus
>linguists will talk about "historical corpora" (which presumably is a
>corpus that focuses on historical variance, possibly at the expense of
>other aspects).
>
>_______________________________________________
>UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
>Corpora mailing list
>Corpora at uib.no
>http://mailman.uib.no/listinfo/corpora



_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list