Corpora: corpora: evidence and intuition

Thu Nov 1 20:48:37 UTC 2001

Patrick (and list memmebrs) I wanted to jump into this discussion earlier so I'm glad you have now joined it. 

Your point about the possible non-salience of copula verbs (sales totaling $100) struck a chord - I still remember my first "discovery" on looking at the COBUILD corpus (circa 1982) was that "represent" often appeared as V+C in expressions like "this represents a major breakthrough" - yet none of the English pedagogical dictionaries had spotted it up to that point.

To a degree at least, some of these oddities are explained by corpus composition - totaling a car might come up in unscripted American speech (or maybe in a movie like "Clueless") but I wouldn't expect to find it in your corpus (BNC - purely British - plus Reuters and AP - news text, I assume); and conversely corpora like AP and WSJ are bound to have an awful lot of "revenues totaling $50m" etc. That'd also explain some of the other contributions (eg John Williams on radio station collocating so often with "seize" and "take over": suspect the source here [Bank of Enfglish] is a tad overweight in journalistic texts)

But of course there's more to it than this.

The thing I wanted to add, tho, was to slightly re-phrase Sebastian's original question from 

-what makes you say "Wow, I wouldn't have thought that" to

"Wow, I wouldn't have thought OF that" (if I hadn't looked in the corpus)

-meaning. : most of the time (not all, of course) the corpus reveals something we sort of already knew but could not retrieve through the unreliable process of introspection: i.e., when I saw that use of "represent" it wasn't that I'd never heard of it before (far from it) - so often, our response is more like "Of course, why didn't I think of that?!". 

People doing corpus lexicography do indeed find they are subtly (and sometimes not so subtly) tweaking the description of English in their dictionaries, almost daily, to reflect insights that could only have been gleaned from a good corpus - but on the whole these insights do not actually "surprise" us (imho). 

Here's an example. It looks like CORE is now becoming an adjective (as well as a noun&verb). We're all familiar with the noun-modifier use beloved of management gurus (core business/competences/values etc) but now we're seeing even more adjective-like signs (e.g. this is absolutely core; core to this design is a sense of .). So the evidence suggests we shd add a new word class. That's great, and I seriously doubt we could have recognised this without corpus data - but is it really a "surprise"? 

In fact I'm slightly suspicious of people who claim to be continually "surprised" by what they find in corpora (of their own native languages anyway) - it suggests to me their intuitions aren't very good. (At least, as far as lexical data goes; I'm persuaded by some other contributions, e.g. John MCKenny's point about "would", that we are probably not at all that good at predicting the relative frequency of grammatical systems)

I know intuition is a dirty word in some circles, but I think we need to *completely* distinguish it from introspection (i.e .where you just try to retrieve data from your own mental lexicon - this of course IS demonstrably unreliable). Could we say in this context intuition is the faculty by which humans interact with and interpret corpus data? All I know is, you don't get far without it in lexicography. Having worked with/hired/trained/been trained by maybe 150-200 lexicographers over the years, I would bet my last shirt that someone with lousy intuition, given the best lingusitic resources and software in the universe, would produce a much worse dictionary than someone with great intuitions and just a modest corpus with basic software - would you agree Patrick (and others)?

Michael rundell