<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<META NAME="Generator" CONTENT="MS Exchange Server version 5.5.2658.24">
<TITLE>RE: [Corpora-List] Brown Corpus</TITLE>
</HEAD>
<BODY>
<BR>
<BR>
<P><FONT SIZE=2>I'm somewhat surprised by Martin Wynne's comments against using fixed size corpora samples.</FONT>
<BR><FONT SIZE=2>You have to realize that not only does the intended uses of the corpus change what is an appropriate sampling strategy, but whatever sampling strategy you employ will introduce some bias into the corpus.</FONT></P>
<P><FONT SIZE=2>If one is constructing a corpus to sample vocabulary statistics, then it would be very hard to argue that</FONT>
<BR><FONT SIZE=2>you should not use fixed size samples. Different sizes of samples could seriously skew vocabulary statistics. Alternatively, if one is building a corpus to study narrative style, it would be hard to argue that anything other than large whole rhetorical text units would be adequate. There is a lot of middle ground between gathering statistics on word frequency and narrative style and those factors should also be brought to bear on corpus sampling strategy. </FONT></P>
<P><FONT SIZE=2>I am not certain there is ONE strategy on creating samples that would please everyone. One idea might be to gather larger samples of text and provide one or more sub-corpora of samples within the larger corpus to produce more reasonable vocabulary counts. There is nothing that says your texts have to have only one corpus made from them any more than photographs can only be presented exactly as they are shot, rather than cropped to make other pictures.</FONT></P>
<BR>
<BR>
</BODY>
</HTML>