Using the BNC

Scott Sadowsky lists at SPANISHTRANSLATOR.ORG
Tue Dec 17 08:28:36 UTC 2002

Dave Wilton:
>But the page number is essential if you want to check the accuracy of the
>electronic version. One has no assurance that the text wasn't altered when it
>was digitized. Such changes happen often enough that it is not a trivial
>Depending on the importance of the quotation to your work, you may very much
>want to go back and look at it in the original just to make sure.

I guess this reflects our different criteria and goals.  Coming from a
corpus linguistics approach, I would assign virtually no importance to any
single occurrence of *anything*.  First of all because, as you point out, a
certain percentage of bad data is inevitable, and that means that you need
multiple occurrences of what you're interested in before you can even be
sure it's actually what you've found.  And secondly, there is very little
that you can reliably conclude from a single event of any sort.

 From this perspective, antedating based on single examples (to take on a
favorite ADS-L pastime) is really quite pointless.  Of course, if the goal
is to assemble everyone's antedatings of a given word at some point and
work from there, then it makes some sense.  But this approach is utterly
hit-and-miss, and to my mind the only responsible reaction to reports of
such data points is to file them away until there are enough to begin to
draw at least moderately sound conclusions.

Personally, if I were an antedater I'd stop reading old texts, start OCRing
them, assemble them into corpora when I had a couple million words, and
then knock off a five or ten thousand antedatings in a week or two.

And yes, I'm aware you weren't talking specifically about antedating...
call it a ranty tangent if you will.


Scott Sadowsky
Centro de Estudios Cognitivos, Universidad de Chile
sadowsky at · ssadowsk at

More information about the Ads-l mailing list