[Corpora-List] Re: Are Corpora Too Large?

Thu Oct 3 08:23:10 UTC 2002

Dear Robert (if I may)

Thank you for a stimulating contribution!

You raise too many interesting and complex issues for me to
reply to adequately, as I am about to leave for the airport.

However, here are one or two points which came readily to mind
(obviously, I'm going to disagree with you about corpus size, but
one does not often get a chance to reinspect the argumentation).

> We want examples of lexical usage
> grammatical constructions, perhaps
> even anaphora between multiple sentences.
Pragmatics and discourse organizers, and even 
semantics, often need a substantial context to 
make it clear what is going on.

Most current corpora are - for purely technological reasons -
heavily biased towards the written word. We wuld obviously like to 
have at least an equal amount of speech, before we can know about
features of the spoken language which may also require substantial
context.

> I haven't heard many talk about
> corpora as good ways to study
> the higher level structure of documents--largely because to do so requires
> whole documents and extracts
> can be misleading even when they have reached 45,000 words in size (the
> upper limit of samples in the British
> National Corpus).

Not all corpora use text extracts. Cobuild/Birmingham has always used entire texts
wherever possible; although the term "text" is itself problematic: do we treat a whole
issue of a newspaper as a single text, or a collection of smaller texts, i.e. articles?
Each article has a certain unity, but so does each issue (in-house editorial policies,
the day's topics, etc).

> The main question here is if we are seeking lexical variety, if the lexicon
> basically consists of Large Numbers
> of Rare Events (LNREs), then why aren't we collecting language data to
> maximize the variety of that type of
> information rather than following the same traditional sampling practices of
> the earliest corpora?
Some of us may want to test the hypothesis that the lexicon consists of
LNREs. It is often possible to group the LNREs into sets, groups, or classes
of various kinds, which share some behavioural properties.
Also, some of us may want to know more in detail about the opposite
(SNFEs? Small Numbers of Frequent Events?). A few years back, while 
trying to find examples for the difference between British and American usage
of "have" and "take" , I discovered the financial expression "take a bath" (unknown 
to me at the time, and not recorded in any of the reference works I had access to).
So rare events may be going on undiscovered in the bulk of what we superficially
took to be a frequent event.

> Because text was manually entered, one really couldn't analyze it
> until AFTER it had been selected for use in the corpus.
>You picked samples on the basis of their external
> properties and discovered their internal
> composition after including them in the corpus.

As far as I know, most of the software I use to analyse corpus data
needs the data to be in the corpus before it can perform the analysis.
This may be easily redesigned, but that is beyond my knowledge. If
it is easy, I'm surprised that some of my enterprising software colleagues haven't 
done it already. Of course, part of the analysis consists of seeing what effect the 
arrival of new data has had on the whole corpus, which you couldn't do if you 
analysed the new data separately.

I'm sure experts on seals do not object to checking each seal that comes into
their survey area, just because they have seen seals before, or even if they have seen
the same seal many times before. It may always offer something new, or at least serve
to confirm hypotheses which are well established. Corpora also exist to confirm in
a more robust way ideas we may have had about language for centuries. It helps us
to draw a finer distinction betwen the invariable and the variable. I can make some
statements about English with greater confidence from a 450m word corpus
than I could from a 1m corpus. Of course, I may also have gained some insights
during the time it has taken to increase the corpus by this amount. So there may also
be qualitative improvements in our analyses.

> with little note of whether a sample increases the variety
> of lexical coverage or not.

> The question is whether we could track the number of new terms appearing in
> potential samples from a new source
> and optimally select the sample that added the most new terms to the corpus
> without biasing the end result.
> In my metaphor, whether we could add muscle to the corpus rather than just
> fatten it up.

You seem to be overly concerned with lexical variety and new items.
Many of us are quie happy just to know a little bit more about the old items.
Every linguistic statement deserves to be reinvestigated, especially those
that we may have taken as axiomatic in the past. The increasing size of 
corpora adds not only breadth, but also depth.

> This also raises the question of why have sample sizes grown so large? The
> Brown corpus created a million words from
> 500 samples of 2000 words each. Was 2000 words so small that everyone was
> complaining about how it stifled their
> ability to use the corpus? Or is it merely that given we want 100 million
> words of text it is far easier to
> increase the sample sizes by 20-fold than find 20 more sources from which to
> sample.

Ideally, surely we would want to do both. Depth and breadth again.

Best
Ramesh

----- Original Message -----
X-Server-Uuid: 0bf4d294-faec-11d1-a39a-0008c7246279
From: "Amsler, Robert" <Robert.Amsler at hq.doe.gov>
To: corpora at hd.uib.no
Subject: [Corpora-List] Are Corpora Too Large?
X-WSS-ID: 118436181279709-02-02
X-checked-clean: by exiscan on alf
X-Scanner: 5f1f8efb186516f548a370690b11567e http://tjinfo.uib.no/virus.html
X-UiB-SpamFlag: NO UIB: -0.8 hits, 8 required;

Heresy! But hear me out.

My question is really whether we're bulking up the size of corpora vs.
building them up to meet our needs.

Most of the applications of corpus data appear to me to be lexical or
grammatical, operating at the word,
phrase, sentence or paragraph level. We want examples of lexical usage,
grammatical constructions, perhaps
even anaphora between multiple sentences. I haven't heard many talk about
corpora as good ways to study
the higher level structure of documents--largely because to do so requires
whole documents and extracts
can be misleading even when they have reached 45,000 words in size (the
upper limit of samples in the British
National Corpus).

The main question here is if we are seeking lexical variety, if the lexicon
basically consists of Large Numbers
of Rare Events (LNREs), then why aren't we collecting language data to
maximize the variety of that type of
information rather than following the same traditional sampling practices of
the earliest corpora?

In the beginning, there was no machine-readable text. This meant that
creating a corpus involved typing in text
and the amount of text you could put into a corpus was limited primarily by
the manual labor available to enter
data. Because text was manually entered, one really couldn't analyze it
until AFTER it had been selected for
use in the corpus. You picked samples on the basis of their external
properties and discovered their internal
composition after including them in the corpus.

Today, we largely create corpora based on obtaining electronic text and
sampling from that text. This means that
we have the additional ability to examine a lot of text before selecting a
subset to become part of the corpus.
While external properties of the selected text are as important as ever and
should be representative of what types
of text we feel are appropriate to "balance" the corpus, the internal
properties of the text are still taken
almost blindly, with little note of whether a sample increases the variety
of lexical coverage or not.

The question is whether we could track the number of new terms appearing in
potential samples from a new source
and optimally select the sample that added the most new terms to the corpus
without biasing the end result.
In my metaphor, whether we could add muscle to the corpus rather than just
fatten it up.

This also raises the question of why have sample sizes grown so large? The
Brown corpus created a million words from
500 samples of 2000 words each. Was 2000 words so small that everyone was
complaining about how it stifled their
ability to use the corpus? Or is it merely that given we want 100 million
words of text it is far easier to
increase the sample sizes by 20-fold than find 20 more sources from which to
sample.