[Corpora-List] Corpus Sanitation

Adam Kilgarriff Adam.Kilgarriff at itri.brighton.ac.uk
Sat Nov 30 13:02:42 UTC 2002


All,

    As academics, we would like to leave the data entirely uncorrupted,
so we'd rather not anonymise - but then ethical issues mean, for some
purposes, we have to.

    Exactly the same applies to 'bad' (taboo, as LDOCE3 marks it)
language.  I have datasets I'd like to give easy access to, for language
learners.  Do I want children/young people accessing my website/CD-ROM
to encounter taboo language?  Will I be exposed to lawsuits from shocked
parents if I do?

    Like anonymisation, it's hard.  Throwing out sentences/texts with
taboo words or strings is at least straightforward - you can find them
without exhaustive reading.  But, as with anonymisation where there are
no explicit names, there are taboo texts without taboo words, so if you
want to be confident you are not disseminating taboo language,  if
you're sources aim to cover the breadth of language use, an awful lot of
reading is required.  I recently had a conversation with a dictionary
publisher facing the same predicament: yes, he did an awful lot of reading.

    Adam

Mcenery, Tony wrote:

>Dear All,
>
>I was interested to read in the recent posting to the list by Zhiping Zheng
>(see below) that he was uncertain as to whether he should make his corpus
>publicly available because it contained some 'uncensored words' (Zhiping's
>point 2). I guess that this means 'bad language' (I assume it does not relate
>to anonymization issues as they are covered in Zhiping's point 1).If this is
>about 'bleeping out' words in corpora, shouldn't we encourage Zhiping not to do
>this? Surely we want corpora to contain uncensored speech? The point, for me,
>of using corpora is to describe/account for language as it is, rather than
>language as we wish it to be.
>
>Best,
>
>Tony
>
>----- Original Message -----
>From: "Zhiping Zheng" <zzheng at umich.edu>
>To: <corpora at hd.uib.no>
>Sent: 21 November 2002 22:57
>Subject: Re: [Corpora-List] Looking for some corpora about why-questions,
>how-questions, and their answers.
>
>
>
>>Dear all,
>>
>>I got several responses asking if I am planning to make my question
>>list public. I think I should answer this question to the whole list.
>>
>>I am willing to make it public but I am not sure if I should do it
>>right now. Here are the reasons:
>>
>>1. Some questions ask information about specific people, not only
>>celebrities, but also probably the questioners or other people with
>>very close relationships to the questioners. This may raise some
>>privacy issues. I prefer to take off these questions before make the
>>question corpus public.
>>
>>2. Some questions, actually not a small number, contain some
>>uncensored words. I think these questions are improper to be in a
>>corpus.
>>
>>3. Many questions are not grammatically correct or with some spell
>>errors. I personally think this is ok becaues the questions are from
>>real world. I don't know what other researchers think about this.
>>
>>4. Different researchers may have different expections. For example,
>>the original poster of this thread required why- and how- questions,
>>other people have asked about statistic information on specific phrase
>>groups. I would like to know if there are some common requirements
>>from most or many researchers.
>>
>>5. After I do something to the question archive and make it public, I
>>am thinking of updating the public question corpus time to time. More
>>efforts have to take and I am not sure if I have enough energy to do
>>this. I hope some one is willing to join me.
>>
>>I am waiting for your inputs. Especially if you are willing to do
>>something for building the corpus, I am happy to work with you.
>>
>>Many thanks.
>>
>>Zhiping
>>
>>
>
>
>

--
New! MSc and Short Courses in Lexical Computing and Lexicography
http://www.itri.brighton.ac.uk/lexicom

====================================================
Adam Kilgarriff
Senior Research Fellow
ITRI                           t: +44 (0)1273 642919
University of Brighton         f: +44 (0)1273 642908
Lewes Road               e: adam at itri.brighton.ac.uk
Brighton BN2 0BL
UK
     http://www.itri.brighton.ac.uk/~Adam.Kilgarriff
====================================================



More information about the Corpora mailing list