[Corpora-List] ACL proceedings paper in the American National Corpus

Mon Sep 30 16:47:27 UTC 2002

There is clearly an issue here regarding what the American National Corpus
is trying to represent. The Brown Corpus tried to be "representative" by
extracting equal-sized samples selected from all the publications of a given
year. As has been found, it failed to adequately determine that all the
texts were created by American authors and alas, 1 million words we now know
to be quite small (adequate only for a Pocket Dictionary worth of entries).
Collegiate dictionaries require at least a 10 million word corpus, and
Unabridged dictionaries at least 100 million words (the target of the ANC).

However, what I detect to this point from ANC literature is that they are
first trying to fill the quota of 100 million words and only secondarily
concerned about "balancing" the corpus for genre and sample sizes.

Also, if I'm not mistaken, the Brown corpus didn't JUST balance for genres,
it tried to balance for timespan. I.e., it tried to form a closed universe
of possible publications and then representatively sample from that
universe.
This involves attempting to determine all the possible publications in that
universe and then selecting a subset which represents them in both quantity
and genre. While it may seem ambitious to first decide what is in the list
of all available publications (especially, if your criterion for inclusion
is merely "published after 1990"), it may be the only way to have a universe
from which a truly random sample can be extracted.

Note: Brown Corpus Manual http://www.hit.uib.no/icame/brown/bcm.html

Robert A. Amsler
robert.amsler at hq.doe.gov
(301) 903-8823

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20020930/a871d5c5/attachment.htm>