[Corpora-List] Corpus Sanitation
FIDELHOLTZ DOOCHIN JAMES LAWRENCE
jfidel at siu.buap.mx
Fri Nov 29 01:49:00 UTC 2002
Hi all,
I absolutely second Tony's post. In fact, I have issues in
principle with anonymization, as this will obviously affect phonological
aspects of the corpus, due to the very anonymization. Likewise, it will
tend to skew proper nouns, as these are the ones anonymized, generally,
and these are some issues which interest me particularly. I know that
people have addressed the general issue, and the ethical questions are
real, but there must be some way around this problem.
Jim
On Wed, 27 Nov 2002, Mcenery, Tony wrote:
>Dear All,
>
>I was interested to read in the recent posting to the list by Zhiping Zheng
>(see below) that he was uncertain as to whether he should make his corpus
>publicly available because it contained some 'uncensored words' (Zhiping's
>point 2). I guess that this means 'bad language' (I assume it does not relate
>to anonymization issues as they are covered in Zhiping's point 1).If this is
>about 'bleeping out' words in corpora, shouldn't we encourage Zhiping not to do
>this? Surely we want corpora to contain uncensored speech? The point, for me,
>of using corpora is to describe/account for language as it is, rather than
>language as we wish it to be.
>
...
Blues great and cognitive scientist Robert Johnson on the mind/brain:
"If ever I gotta bust your brains out, baby,
Hoooo, It'll make you lose your mind."
James L. Fidelholtz e-mail: jfidel at siu.buap.mx
Posgrado en Ciencias del Lenguaje tel.: +(52-2)229-5500 x5705
Instituto de Ciencias Sociales y Humanidades fax: +(01-2) 229-5681
Benemérita Universidad Autónoma de Puebla, MÉXICO
More information about the Corpora
mailing list