[Corpora-List] Corpus Sanitation - no

Christoph Neumann neumann at crosslanguage.co.jp
Mon Dec 2 03:02:50 UTC 2002


As a mostly quiet observer of Corpora discussion, I was really shocked
by the underlying tendencies in the last threads.

Mr. Zheng is using "uncensored" with the connotation "bad, undesirable",
(inadvertently?)  implying that the opposit, "censored" is something
desirable..... In fact, my immediate reaction to this thread's subject
itself was cold sweat. "corpus" + "sanitation" --- the word "sanitation"
in the meaning "clean-up of unwanted members that dont fit in with
current standards" was last used as a euphemism for the killing of
disabled and homosexual people in Nazi Germany ('"Volkshygiene").

I hope that we are never going to be politically, sexually, religiously
"correct", but only scientifically correct and adequate.

Should MT systems, for instance, refuse to translate sentences like
"Lets blow up the imperialist WTC of devil America", "I think
God/Buddha/Allah is an asshole" or "Are there any nice swinger clubs in
this town?"? Will we have "parent-guided" MT sponsored by Disney, or
party-conform IR acknowledged by Chinese CP? No, please not.

The fact that the lingua franca in the linguistic and NLP community is
the language of the English-speaking countries, does not imply that our
scientific standards are to be adapted to doubtful ethic "standards" in
the Anglo-American society, or to any other system of values or beliefs.

>>>
>>> 2. Some questions, actually not a small number, contain some
>>> uncensored words. I think these questions are improper to be in a
>>> corpus.
>>
--
Dr. Christoph Neumann 		neumann at crosslanguage.co.jp
R&D MT, CrossLanguage KK
Tokyo, Japan
http://www.crosslanguage.co.jp/english/index.html



More information about the Corpora mailing list