The ANC (was: Using the BNC)

Wed Dec 18 17:20:10 UTC 2002

Anyone who is not aware that there is a consortium building an American
equivalent to the BNC should check out
http://americannationalcorpus.org/

A corpus is an invaluable tool that enhances and facilitates a
lexicographer's ability to analyze the behavior of language. It doesn't
replace citation cards/an electronic citation bank for etymological
purposes or for illustrations of the uses of words that predate the range
on the corpus. I would argue, however, that it should replace citations as
a main source of example sentences for uses that fall within the date range
represented in the corpus, particularly for the core of the language--the
most common words that are under-represented in citation banks because they
are not "interesting" as a focus of antedatings or neologisms. Corpora vary
according to their collection criteria: some are synchronic and some are
diachronic, with varying spans of years; some focus only on, say, newspaper
text, and others strive for a balanced range. The BNC (and the nascent ANC)
contains everything from highly edited text from scholarly journals to
ephmera. The ANC will contain e-mail. So it's not accurate to say that all
the texts in corpora are edited.

The key difference between a corpus and a bank of citation cards is that
citations center around a term that stuck out to the citation gatherer for
any number of reasons, and the surrounding context is usually quite short
(less than 250 words); a corpus, in contrast, contains long stretches of
uninterrupted text. This makes a corpus a superior tool for analyzing
frequency of use of common terms (invaluable for second language learners'
dictionaries), collocational behavior, and usage patterns.

People have argued that if a citation bank is in electronic format, it
amounts to a corpus, but I disagree: the stretches of text are simply too
short to attempt any meaningful analysis of discourse markers, argument
threads, and other rhetorical devices such as the way people set up stories
they are about to tell. It's very difficult to persuade people until they
sit down and actually USE a corpus--try, for example, to write the entry
for "set" using citation cards AND a corpus. I find that all doubt about
the value of a corpus disappears after that exercise.

Wendalyn Nichols
(advisor to the ANC consortium, and British-trained lexicographer)

At 05:24 AM 12/18/02 -0500, Frank Abate wrote:
>What Michael Q observed below, in response to Jonathon G's point, just about
>nails it, in brief, re the value of a corpus to lexos (lexicographers).  One
>could go on, but MQ has captured the essence of the value of a corpus to
>lexicography -- a corpus allows research into collocational patterns and
>other such phenomena that are at the heart of how the language works.  This
>is impossible to research otherwise, at least not in any sort of depth (with
>hats off to the BBI Combinatory Dict, etc.).
>
>Other lexos, such as Sue Atkins and Patrick Hanks, are FAR more conversant
>on the value of corpora to lexicography, so please seek them out -- or their
>published works -- for more details and effusion.  Also, you could look at
>the Intro to either the New Oxford Dictionary of English (UK), or its
>transatlantic cousin, the New Oxford American Dictionary.  Both of these are
>corpus-based, and, I think, the first general dicts to be corpus-based (the
>Collins Cobuild is not a general dict, strictly speaking).
>
>If I were to embark on a new general (not historical) dictionary project,
>working from a "blank piece of paper", and were given a choice of having
>either a good general corpus or a citation file, I would absolutely choose
>the corpus -- in a heartbeat.  One can get citational evidence from OED,
>MW3, etc., and can also do Googling and the like for specific word/sense
>research.  But if you want to look at the core of the language down into the
>nitty-gritty details, you can't beat a good, solid, contemporary corpus, as
>long as it is extensive enough -- say, 100 million words AT LEAST.  The
>bigger, the better.
>
>Frank Abate
>
>******************************
>Jonathon Green wrote:
>
> > what it does not do, however, is give a page number for the
> > material cited. It gives a page range (presumably those read for
> > the Corpus), e.g. 'pp. 62-165' and the number of  's-units' and the
> > total word count, but, as I say, no page number as such. This, for
> > my purposes, and I would imagine those of other lexicographers,
> > renders it more interesting than practically helpful.
>
>I don't use it (I don't have access to it), but I do use the cut-down
>CD-ROM version that was produced some years ago. My understanding is
>that its great value, as with other corpora, lies in the opportunity
>it gives to identify and rank collocations and to assess the relative
>importance and frequency of various forms in a balanced image of one
>regional type of English.
>
>As others have mentioned, this is something that a search of Google
>cannot so easily do, since there are all sorts of systemic biases in
>the material that it indexes. Where Google scores over corpora,
>however, is that it is a different kind of snapshot, one of current
>English that is to a significant degree free from the strictures of
>good taste and editing. I've found it immensely useful, for example,
>when trying to judge whether a form has gained wide currency as a
>folk etymology (chaise lounge, bare with me, without further adieux,
>etc).
>
>
>--
>Michael Quinion
>Editor, World Wide Words
>E-mail: <TheEditor at worldwidewords.org>
>Web: <http://www.worldwidewords.org/>
>
>
>For more information:http://polyglot.lss.wisc.edu/dsna/index.html
>Post message: DSNA at yahoogroups.com
>Unsubscribe: DSNA-unsubscribe at yahoogroups.com
>
>
>Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/