[Corpora-List] Corpus Sanitation

Geoffrey Sampson geoffs at cogs.susx.ac.uk
Fri Nov 29 10:18:37 UTC 2002


On Zheng Zhiping's posting, to me there is an important difference between
"bad language" and individuals' names, or information that could lead to
identification of individuals.  Like Tony McEnery, I don't believe there is
any real reason to censor the "bad language"; it is important linguistic
data, and we are all grown-ups.  But I do think that before such a resource
is made public, strenuous efforts should be made to eliminate any possibility
of users identifying either the individuals who produced the material, or
any individuals or individual institutions written about.  Actually, under
various national Data Protection laws I suspect it might be illegal not to
do this, even if the material is simply held at one institution and not
circulated.  But it ought to be done anyway, for reasons that I discuss at
some length in the "ethics" section of the documention file accompanying
my CHRISTINE1 Corpus (available via the Web, from my home page
www.grsampson.net follow links to downloadable research resources and then
CHRISTINE).  I discuss there what seems to me to have been inadequate
practice in this respect in the spoken section of the British National Corpus.
There are places where really damaging things are said in a quite casual
way in conversation about people, or organizations, who/which might easily
be identified by people who know them (and could probably be identified by
strangers with only minimal detective work).  The recorded speakers had no
motive to worry about this, but I believe corpus linguists have a responsibility
not to let such casual gossip about identifiable people be turned into
permanent public records.

Geoffrey Sampson


Prof. G.R. Sampson   MA   PhD   MBCS

Professor of Natural Language Computing
School of Cognitive & Computing Sciences
University of Sussex
Falmer, Brighton BN1 9QH, GB

e-mail geoffs at cogs.susx.ac.uk (no attachments please)
tel. +44 1273 678525
fax  +44 1273 671320
web http://www.grsampson.net



More information about the Corpora mailing list