[Corpora-List] Corpus Sanitation

FIDELHOLTZ DOOCHIN JAMES LAWRENCE jfidel at siu.buap.mx
Fri Nov 29 01:49:00 UTC 2002


Hi all,
	I absolutely second Tony's post.  In fact, I have issues in
principle with anonymization, as this will obviously affect phonological
aspects of the corpus, due to the very anonymization.  Likewise, it will
tend to skew proper nouns, as these are the ones anonymized, generally,
and these are some issues which interest me particularly.  I know that
people have addressed the general issue, and the ethical questions are
real, but there must be some way around this problem.
		Jim

On Wed, 27 Nov 2002, Mcenery, Tony wrote:

>Dear All,
>
>I was interested to read in the recent posting to the list by Zhiping Zheng
>(see below) that he was uncertain as to whether he should make his corpus
>publicly available because it contained some 'uncensored words' (Zhiping's
>point 2). I guess that this means 'bad language' (I assume it does not relate
>to anonymization issues as they are covered in Zhiping's point 1).If this is
>about 'bleeping out' words in corpora, shouldn't we encourage Zhiping not to do
>this? Surely we want corpora to contain uncensored speech? The point, for me,
>of using corpora is to describe/account for language as it is, rather than
>language as we wish it to be.
>
...


Blues great and cognitive scientist Robert Johnson on the mind/brain:
"If ever I gotta bust your brains out, baby,
Hoooo, It'll make you lose your mind."

James L. Fidelholtz			e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje	tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades	fax: +(01-2) 229-5681
Benemérita Universidad Autónoma de Puebla, MÉXICO



More information about the Corpora mailing list