Corpora: Diacritics and "deviant" texts in corpora

Sat Apr 21 21:39:20 UTC 2001

I must urge Tadeusz Piotrowski **not** to standardize or
normalize Polish e-mails or news agency feeds when
adding them to his corpus.

His question (whether the corpus builder should regard an email
message which lacks diacritics as defective and should correct the
defects) seems to me very important for all corpus linguists. My
mentor and friend John Sinclair has always banged on about keeping the
corpus data "raw". This is precisely a situation where "improving" the
data at the time of data capture will lead to horrible confusion.

For many years people who used the Cobuild Bank of English corpus
moaned at me because of the "errors" in it. (Indeed, none of us is
perfect and there are some errors in the corpus!) But often the
perception that there were too many "errors" in the corpus came about
simply because the data collected did not conform to the linguists'
prior assumptions about what English text **should** look like.

Taking a practical (if somewhat flippant) example: suppose we were to
correct all non-standard uses of the English apostrophe, such as "I
love it's nutty taste" -> "I love its nutty taste".  It would become
pointless to conduct corpus investigations into the use of the
apostrophe in English, because the raw data would have been tampered
with.

This notion of "raw" data is crucial: it goes to the very heart of
corpus linguistics as a distinct and innovative branch of linguistic
science. If you don't trust or believe in the data you collect, then
you might just as well invent your own sentences and study them
instead! Here's one to get you started: "Colourless green ideas sleep
furiously" -- that should keep a few people going for the next 30
years...

Jem Clear

Jem Clear Ltd
29 School Road, Moseley, Birmingham, B13 9TF, UK
Tel & Fax: +44 (0)121 689 3637
Email:     jem at jemclear.co.uk