Corpora: What is a corpus

Fri Jan 28 02:22:55 UTC 2000

I tend to agree at one level, but a corpus of proverbs is a possibility =
- e.g. the Bible
contains one, the dictionaries and collections of proverbs are corpora - =
not so different
from the corpus of Shakespeare or a corpus of religious or legal =
writings or telephone
conversations or parent-child speech, although the bibilical proverbs
usually take a more extended form than our English ones (some of which =
come
from the Bible anyway).

But once you go below sentence level, you are bringing in the kind of =
assumptions we aim to avoid in corpus work.  Even selection at 'sentence =
level' is problematic due to process of context and elision, stylistic =
freedom in relation to punctuation and representation of clauses as =
lists or separate sentences, etc. e.g.

What time is it?  Three thirty!
I came, I saw, I conquered!
I came!  I saw!  I conquered!

Another tendency is for statistics about parsers to be based on =
sentences restricted to be
less than X words where X is typically around 20 and usually less than =
median length
for the corpus it is extracted from.  Such practices should be =
deprecated except when filtering is integral to a theory (e.g. of =
language acquisition - attending to only certain types of utterance - =
but this doesn't alter the corpus).

dP
-----Original Message-----
From: Susan Hays <susanh at naa.att.ne.jp>
To: CORPORA at hd.uib.no <CORPORA at hd.uib.no>
Date: Friday, January 28, 2000 9:15 AM
Subject: Corpora: What is a corpus

>Oliver has stuck an important chord with my thinking. Many of the =
questions
asked
>on this list request pre-filtered work. A corpus is a collection of =
texts,
not a
>list of phrases, verb forms, or other fragments.
>
>One of the real joys of working with corpora is the excitement of =
finding
>something you weren't looking for. The more the input to the corpus is
filtered by
>the preconceptions of the researchers, the less likelihood that these
unexpected
>insights will arise. Of course, the nature of the storage medium
necessitates that
>some filtering must occur, but it is important that these technical
requirements
>are kept in mind when examining the corpora. Only by looking for things =
we
aren't
>looking for will we gain deep insights into the nature of language.
>
>-Paul Hays (currently writing from a borrowed eddress)
>
>Oliver Mason wrote:
>
>> Fran=E7ois Maniez writes:
>> >       I wondered whether anybody on the list knows  about an online
corpus
>> >available for download and consisting of English proverbs and/or set
>> >phrases. The objective is to turn the corpus  into a data base that
could
>> > [...]
>>
>> Andrew Harley replies:
>> > Instead of a corpus, you might want to consider using an existing
>> > dictionary which gives examples of idioms in context, e.g. the
Cambridge
>> > International Dictionary of Idioms. This is available as SGML data =
for
>>
>> Sorry to appear pedantic, but how would a `corpus of proverbs' look
>> like?  I would think no such thing could exist, just like you =
couldn't
>> have a corpus of past tense sentences.  Instead, you have a corpus =
of,
>> say, written fiction, which you can use to compile a list/database of
>> proverbs, but that would not be a corpus, but a, erm, list or
>> database (or even a dictionary).
>>
>> My understanding of `corpus' is that it is some more or less
>> homogeneous collection of utterances, but not `filtered', ie if you
>> selected all sentences containing proverbs you would end up with a
>> list, not a (sub)corpus.
>>
>> Do other people think different/the same?
>>
>> Oliver
>>
>> --
>> //\\ computer officer | corpus research | department of english | =
school
of  -
>> //\\ humanities | university of birmingham | edgbaston | birmingham =
b15
2tt  -
>> \\// united kingdom | phone +44-(0)121-414-6206 | fax
+44-(0)121-414-5668/\  -
>> \\// mobile 07050 104504 | http://www.clg.bham.ac.uk |
o.mason at bham.ac.uk\/  -