[Corpora-List] Are Corpora Too Large?

Wed Oct 2 12:38:18 UTC 2002

Heresy! But hear me out.

My question is really whether we're bulking up the size of corpora vs.
building them up to meet our needs.

Most of the applications of corpus data appear to me to be lexical or
grammatical, operating at the word,
phrase, sentence or paragraph level. We want examples of lexical usage,
grammatical constructions, perhaps
even anaphora between multiple sentences. I haven't heard many talk about
corpora as good ways to study
the higher level structure of documents--largely because to do so requires
whole documents and extracts
can be misleading even when they have reached 45,000 words in size (the
upper limit of samples in the British
National Corpus).

The main question here is if we are seeking lexical variety, if the lexicon
basically consists of Large Numbers
of Rare Events (LNREs), then why aren't we collecting language data to
maximize the variety of that type of
information rather than following the same traditional sampling practices of
the earliest corpora?

In the beginning, there was no machine-readable text. This meant that
creating a corpus involved typing in text
and the amount of text you could put into a corpus was limited primarily by
the manual labor available to enter
data. Because text was manually entered, one really couldn't analyze it
until AFTER it had been selected for
use in the corpus. You picked samples on the basis of their external
properties and discovered their internal
composition after including them in the corpus.

Today, we largely create corpora based on obtaining electronic text and
sampling from that text. This means that
we have the additional ability to examine a lot of text before selecting a
subset to become part of the corpus.
While external properties of the selected text are as important as ever and
should be representative of what types
of text we feel are appropriate to "balance" the corpus, the internal
properties of the text are still taken
almost blindly, with little note of whether a sample increases the variety
of lexical coverage or not.

The question is whether we could track the number of new terms appearing in
potential samples from a new source
and optimally select the sample that added the most new terms to the corpus
without biasing the end result.
In my metaphor, whether we could add muscle to the corpus rather than just
fatten it up.

This also raises the question of why have sample sizes grown so large? The
Brown corpus created a million words from
500 samples of 2000 words each. Was 2000 words so small that everyone was
complaining about how it stifled their
ability to use the corpus? Or is it merely that given we want 100 million
words of text it is far easier to
increase the sample sizes by 20-fold than find 20 more sources from which to
sample.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20021002/7322d4f6/attachment.htm>