<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2653.12">
<TITLE>Are Corpora Too Large?</TITLE>
</HEAD>
<BODY>
<BR>
<BR>
<P><FONT SIZE=2>Heresy! But hear me out.</FONT>
</P>
<P><FONT SIZE=2>My question is really whether we're bulking up the size of corpora vs. building them up to meet our needs.</FONT>
</P>
<P><FONT SIZE=2>Most of the applications of corpus data appear to me to be lexical or grammatical, operating at the word,</FONT>
<BR><FONT SIZE=2>phrase, sentence or paragraph level. We want examples of lexical usage, grammatical constructions, perhaps</FONT>
<BR><FONT SIZE=2>even anaphora between multiple sentences. I haven't heard many talk about corpora as good ways to study</FONT>
<BR><FONT SIZE=2>the higher level structure of documents--largely because to do so requires whole documents and extracts</FONT>
<BR><FONT SIZE=2>can be misleading even when they have reached 45,000 words in size (the upper limit of samples in the British</FONT>
<BR><FONT SIZE=2>National Corpus).</FONT>
</P>
<P><FONT SIZE=2>The main question here is if we are seeking lexical variety, if the lexicon basically consists of Large Numbers</FONT>
<BR><FONT SIZE=2>of Rare Events (LNREs), then why aren't we collecting language data to maximize the variety of that type of</FONT>
<BR><FONT SIZE=2>information rather than following the same traditional sampling practices of the earliest corpora?</FONT>
</P>
<P><FONT SIZE=2>In the beginning, there was no machine-readable text. This meant that creating a corpus involved typing in text</FONT>
<BR><FONT SIZE=2>and the amount of text you could put into a corpus was limited primarily by the manual labor available to enter</FONT>
<BR><FONT SIZE=2>data. Because text was manually entered, one really couldn't analyze it until AFTER it had been selected for</FONT>
<BR><FONT SIZE=2>use in the corpus. You picked samples on the basis of their external properties and discovered their internal</FONT>
<BR><FONT SIZE=2>composition after including them in the corpus. </FONT>
</P>
<P><FONT SIZE=2>Today, we largely create corpora based on obtaining electronic text and sampling from that text. This means that</FONT>
<BR><FONT SIZE=2>we have the additional ability to examine a lot of text before selecting a subset to become part of the corpus.</FONT>
<BR><FONT SIZE=2>While external properties of the selected text are as important as ever and should be representative of what types</FONT>
<BR><FONT SIZE=2>of text we feel are appropriate to "balance" the corpus, the internal properties of the text are still taken</FONT>
<BR><FONT SIZE=2>almost blindly, with little note of whether a sample increases the variety of lexical coverage or not. </FONT>
</P>
<P><FONT SIZE=2>The question is whether we could track the number of new terms appearing in potential samples from a new source</FONT>
<BR><FONT SIZE=2>and optimally select the sample that added the most new terms to the corpus without biasing the end result.</FONT>
<BR><FONT SIZE=2>In my metaphor, whether we could add muscle to the corpus rather than just fatten it up.</FONT>
</P>
<P><FONT SIZE=2>This also raises the question of why have sample sizes grown so large? The Brown corpus created a million words from</FONT>
<BR><FONT SIZE=2>500 samples of 2000 words each. Was 2000 words so small that everyone was complaining about how it stifled their</FONT>
<BR><FONT SIZE=2>ability to use the corpus? Or is it merely that given we want 100 million words of text it is far easier to</FONT>
<BR><FONT SIZE=2>increase the sample sizes by 20-fold than find 20 more sources from which to sample. </FONT>
</P>
</BODY>
</HTML>