[Corpora-List] Corpus Sanitation

Mcenery, Tony eiaamme at exchange.lancs.ac.uk
Wed Nov 27 09:08:17 UTC 2002


Dear All,

I was interested to read in the recent posting to the list by Zhiping Zheng
(see below) that he was uncertain as to whether he should make his corpus
publicly available because it contained some 'uncensored words' (Zhiping's
point 2). I guess that this means 'bad language' (I assume it does not relate
to anonymization issues as they are covered in Zhiping's point 1).If this is
about 'bleeping out' words in corpora, shouldn't we encourage Zhiping not to do
this? Surely we want corpora to contain uncensored speech? The point, for me,
of using corpora is to describe/account for language as it is, rather than
language as we wish it to be.

Best,

Tony

----- Original Message -----
From: "Zhiping Zheng" <zzheng at umich.edu>
To: <corpora at hd.uib.no>
Sent: 21 November 2002 22:57
Subject: Re: [Corpora-List] Looking for some corpora about why-questions,
how-questions, and their answers.

> Dear all,
>
> I got several responses asking if I am planning to make my question
> list public. I think I should answer this question to the whole list.
>
> I am willing to make it public but I am not sure if I should do it
> right now. Here are the reasons:
>
> 1. Some questions ask information about specific people, not only
> celebrities, but also probably the questioners or other people with
> very close relationships to the questioners. This may raise some
> privacy issues. I prefer to take off these questions before make the
> question corpus public.
>
> 2. Some questions, actually not a small number, contain some
> uncensored words. I think these questions are improper to be in a
> corpus.
>
> 3. Many questions are not grammatically correct or with some spell
> errors. I personally think this is ok becaues the questions are from
> real world. I don't know what other researchers think about this.
>
> 4. Different researchers may have different expections. For example,
> the original poster of this thread required why- and how- questions,
> other people have asked about statistic information on specific phrase
> groups. I would like to know if there are some common requirements
> from most or many researchers.
>
> 5. After I do something to the question archive and make it public, I
> am thinking of updating the public question corpus time to time. More
> efforts have to take and I am not sure if I have enough energy to do
> this. I hope some one is willing to join me.
>
> I am waiting for your inputs. Especially if you are willing to do
> something for building the corpus, I am happy to work with you.
>
> Many thanks.
>
> Zhiping


> On Wed, 20 Nov 2002, ZHIPING ZHENG wrote:
> >
> > Dear Tian-Zuo and others,
> >
> > I have a big corpus which contains over 40K unique questions
> > collected from real world users by my AnswerBus Question Answering
> > System (http://www.answerbus.com/). I am willing to do some research
> > based on
the
> > data together with other people who have the same interest.
> >
> > Zhiping


> > On Wed, 20 Nov 2002, tzshen wrote:
> >
> > > Dear all,
> > >
> > > I am doing some work to find the answer patterns
> > > to help automatic answering some complex questions, which ask for
> > > a
complex answer.
> > > I first focus on why-questions and how-questions.
> > > So I am eager to find some corpora that contains large amount of
> > > this
two types of questions and corresponding answers.
> > > Does anyone know where I can find this kind of corpora or related
resources?
> > > Resouces about other complex questions and answers beyond why- and
how-questions are also welcome.
> > >
> > > THANK YOU ALL VERY MUCH.
> > >
> > > Tian-Zuo, Shen



More information about the Corpora mailing list