[Corpora-List] Corpus Sanitation

Fri Nov 29 16:20:11 UTC 2002

Another comment on anonymisation. The problem is even worse if one wishes
to make available (for whatever purpose) the audio or video tapes from
which transcriptions have been prepared. I believe that considering this
more challenging case also clarifies issues for text-only corpora. I'll
assume video, which is the extreme point.

There are two problems with video. No amount of signal manipulation
however small, preserves the full scientific usefulness of the data.
On the other hand, no reasonable amount of "anonymisation", however large,
really ensures anonymity.

The first point is obvious if one contemplates trying to do psycholinguistic
experiments with the data. It would for example seriously compromise a
comprehension study if proper names are bleeped out, or even replaced by
others. No psychology reviewer would ever accept that such data is as
naturalistic as untreated data.

The second point is almost as obvious, because humans are adept at inferring
personal identity from all kinds of things, including voice quality, ear
shape, gait and so on.  Therefore, whatever one does, short of blanking
everything out,it is difficult to credibly claim that the risk of unintended
identification has been avoided. Geoffrey Sampson's Christine documentation
makes the case that such identification is highly undesirable.

The only way I can see to handle this is to deal with the problem at the
outset by making completely clear to participants what will happen to their
data, and obtaining informed consent. If this is not done, the data is
effectively lost to responsible researchers, and cannot be used except
at risk of infringing participants rights. If, as in the case Sampson
has, promises have been made to participants, those promises must be
honoured. It may or may not be possible to recover data that is both
useful and distributable under these circumstances.

The same difficulties are also present for audio (the BNC audio has never
been distributed, even though it exists). The risk of identification is just
too great and the consequences of that too severe to be acceptable. Although
the visual cues to identity are absent, speaking style persists. One may
feel that the risks are less, but they still exist.

And, here's the rub, the same arguments apply, albeit more weakly, to
text-only corpora. While voice quality is now absent, substantial cues to
personal identity may persist in lexis and other idiosyncrasies, not to
mention that people are extremely adept at reconstructing material from
context. Once again, the risks are arguably less than in the other media,
but they still exist. So, notwithstanding valiant efforts to anonymise in
such a way that the scientific usefulness of the data is preserved, the
original decision to promise anonymity comes back to haunt us. I lean to the
view that there is no difference of principle between the different media.

Is even Sampson's rigorous approach to anonymization enough in practice?
Perhaps, but that depends on a very iffy judgement call. The lesson seems
to be that great care is needed in collecting informed consent for corpus
work.

None of this addresses the additional point made in Sampson's post about
collateral damage to people and organisations not involved in the recording.
I could imagine a prosecution against both participants and corpus
distributors for defamation or slander. That would be bad. Perhaps corpus
collectors need to indemnify participants against this, or perhaps it
suffices to warn people that they are (in effect) speaking in a public
place. Or perhaps we have a duty of care to ensure that our participants do
not put themselves at risk (doubly likely since many corpora include
contributions by children). And that leaves aside the much more likely cases
where nasty stuff in the corpus evokes resentment and unhappiness, but not
enough to lead to prosecutions.

Chris

==================================================================
Dr. Chris Brew,  Assistant Professor of Computational Linguistics
Department of Linguistics, 1712 Neil Avenue, Columbus OH 43210
Tel:  +614 292 5420 Fax: +614 292 8833
Web:http://www.ling.ohio-state.edu/~cbrew Email:cbrew at ling.osu.edu
==================================================================

>
> On Zheng Zhiping's posting, to me there is an important difference between
> "bad language" and individuals' names, or information that could lead to
> identification of individuals.  Like Tony McEnery, I don't believe there is
> any real reason to censor the "bad language"; it is important linguistic
> data, and we are all grown-ups.  But I do think that before such a resource
> is made public, strenuous efforts should be made to eliminate any possibility
> of users identifying either the individuals who produced the material, or
> any individuals or individual institutions written about.  Actually, under
> various national Data Protection laws I suspect it might be illegal not to
> do this, even if the material is simply held at one institution and not
> circulated.  But it ought to be done anyway, for reasons that I discuss at
> some length in the "ethics" section of the documention file accompanying
> my CHRISTINE1 Corpus (available via the Web, from my home page
> www.grsampson.net follow links to downloadable research resources and then
> CHRISTINE).  I discuss there what seems to me to have been inadequate
> practice in this respect in the spoken section of the British National Corpus.
> There are places where really damaging things are said in a quite casual
> way in conversation about people, or organizations, who/which might easily
> be identified by people who know them (and could probably be identified by
> strangers with only minimal detective work).  The recorded speakers had no
> motive to worry about this, but I believe corpus linguists have a responsibility
> not to let such casual gossip about identifiable people be turned into
> permanent public records.
>
> Geoffrey Sampson
>
>
> Prof. G.R. Sampson   MA   PhD   MBCS
>
> Professor of Natural Language Computing
> School of Cognitive & Computing Sciences
> University of Sussex
> Falmer, Brighton BN1 9QH, GB
>
> e-mail geoffs at cogs.susx.ac.uk (no attachments please)
> tel. +44 1273 678525
> fax  +44 1273 671320
> web http://www.grsampson.net

--