[Corpora-List] Brown Corpus

Amsler, Robert Robert.Amsler at hq.doe.gov
Tue Jun 21 12:39:28 UTC 2005



I'm somewhat surprised by Martin Wynne's comments against using fixed size
corpora samples.
You have to realize that not only does the intended uses of the corpus
change what is an appropriate sampling strategy,  but whatever sampling
strategy you employ will introduce some bias into the corpus.

If one is constructing a corpus to sample vocabulary statistics, then it
would be very hard to argue that
you should not use fixed size samples. Different sizes of samples could
seriously skew vocabulary statistics. Alternatively, if one is building a
corpus to study narrative style, it would be hard to argue that anything
other than large whole rhetorical text units would be adequate. There is a
lot of middle ground between gathering statistics on word frequency and
narrative style and those factors should also be brought to bear on corpus
sampling strategy.

I am not certain there is ONE strategy on creating samples that would please
everyone. One idea might be to gather larger samples of text and provide one
or more sub-corpora of samples within the larger corpus to produce more
reasonable vocabulary counts. There is nothing that says your texts have to
have only one corpus made from them any more than photographs can only be
presented exactly as they are shot, rather than cropped to make other
pictures.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20050621/3ed0d5d4/attachment.htm>


More information about the Corpora mailing list