[Corpora-List] What is corpora and what is not?

Chris Brew christopher.brew at gmail.com
Thu Oct 4 14:22:35 UTC 2012


* [Chris Brew]: "I do like the Brown Corpus, which is defined to be
representative of 15 broad categories of writing, all first published in
1961 and all by native speakers of American English...I don't know how to
give a precise operational definition of what the boundaries of the 15
broad categories, and I am not quite sure what "first published" or "native
speaker of American English" would mean in practice... However, in this
case, it really is the thought that counts. By articulating the principles
that guided the creation of the corpus, Kucera and Francis opened the way
to the creation of comparable corpora for other languages and other years."
- I'm afraid you undermined your own case by referring to the categorial
problems. *

I don't think so. If anything, I undermined the case for a position that I
am not at all interested in defending. My answer to Ramesh's big question
about representativeness below is a clear "no".  In fact, I think that the
very idea of "writing as a whole" is unhelpful, and would much rather talk
about specific cases of how and why people write.

The Brown corpus categories have obvious (and probably also non-obvious)
deficiencies, and certainly cannot be adopted wholesale forever. My rather
limited point is that it was helpful that Kucera and Francis were explicit
about their principles. As a matter of fact, many corpora were created with
the explicit goal of being comparable to the Brown Corpus. "Comparable",
here just means that the authors of the new corpus hope that it will be
scientifically useful to make comparisons, and have tried to set up their
corpus to facilitate this. Ramesh is completely entitled to question
whether this effort does or can succeed. That's part of the normal process
of scientific investigation. And he is also right to suggest that the
influence of the Brown Corpus design could be a double-edged thing,
especially when the design is adopted wholesale without much thought.


*An even bigger question is: to what extent are these 15 categories truly
"representative" of writing as a whole?*



-- 
Chris Brew, Educational Testing Service
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20121004/5a5bc519/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list