Corpora: What is a corpus

Susan Hays susanh at naa.att.ne.jp
Thu Jan 27 21:05:20 UTC 2000


Oliver has stuck an important chord with my thinking. Many of the questions asked
on this list request pre-filtered work. A corpus is a collection of texts, not a
list of phrases, verb forms, or other fragments.

One of the real joys of working with corpora is the excitement of finding
something you weren't looking for. The more the input to the corpus is filtered by
the preconceptions of the researchers, the less likelihood that these unexpected
insights will arise. Of course, the nature of the storage medium necessitates that
some filtering must occur, but it is important that these technical requirements
are kept in mind when examining the corpora. Only by looking for things we aren't
looking for will we gain deep insights into the nature of language.

-Paul Hays (currently writing from a borrowed eddress)

Oliver Mason wrote:

> François Maniez writes:
> >       I wondered whether anybody on the list knows  about an online corpus
> >available for download and consisting of English proverbs and/or set
> >phrases. The objective is to turn the corpus  into a data base that could
> > [...]
>
> Andrew Harley replies:
> > Instead of a corpus, you might want to consider using an existing
> > dictionary which gives examples of idioms in context, e.g. the Cambridge
> > International Dictionary of Idioms. This is available as SGML data for
>
> Sorry to appear pedantic, but how would a `corpus of proverbs' look
> like?  I would think no such thing could exist, just like you couldn't
> have a corpus of past tense sentences.  Instead, you have a corpus of,
> say, written fiction, which you can use to compile a list/database of
> proverbs, but that would not be a corpus, but a, erm, list or
> database (or even a dictionary).
>
> My understanding of `corpus' is that it is some more or less
> homogeneous collection of utterances, but not `filtered', ie if you
> selected all sentences containing proverbs you would end up with a
> list, not a (sub)corpus.
>
> Do other people think different/the same?
>
> Oliver
>
> --
> //\\ computer officer | corpus research | department of english | school of  -
> //\\ humanities | university of birmingham | edgbaston | birmingham b15 2tt  -
> \\// united kingdom | phone +44-(0)121-414-6206 | fax +44-(0)121-414-5668/\  -
> \\// mobile 07050 104504 | http://www.clg.bham.ac.uk | o.mason at bham.ac.uk\/  -



More information about the Corpora mailing list