[Corpora-List] Chomsky
Mike Maxwell
maxwell at ldc.upenn.edu
Thu Oct 14 19:37:09 UTC 2004
Someone wrote:
>> I'm looking for the exact bibliographical reference where
>> we can find Chomsky's idea that a corpus presents a language
>> that is defective or corrupted.
>
Ronald J. Craig wrote:
> I don't have Aspects at hand (I think maybe I burned it?)
"A record of natural speech will show numerous false starts, deviations
from rules, changes of plan in mid-course, and so on. The problem for
the linguist, as well as for the child learning the language, is to
determine from the data of performance the underlying system of rules
that has been mastered by the speaker-hearer and that he puts to use in
actual performance." Aspects, pg. 4
FWIW, I don't find anything in the above to disagree with. If you think
otherwise, you might want to consider your reaction to the various
"Bushisms" that are floating around :-).
Notice also that Chomsky says _speech_, i.e. he's talking about spoken
(transcribed) corpora, not prepared texts. Although I would say the
same is probably true of written (non-transcribed) texts, only to a
lesser degree. I just had occasion to worry about how the name "Kim
Jong Il" was to be translated into Panjabi. (Long story.) The
translator had represented the third part of that name using Latin
letters, rather than Gurmukhi (the Panjabi writing system), as "Il"
(eye-el). When questioned, he said that it stood for "the second", and
that it didn't make sense to translate that into Panjabi. Now there are
two problems: one, he's presumably thinking of "II" (eye-eye), even
though he had written "Il" (eye-el). But second, there's an empirical
question (to use Chomsky's term): what is the last word _supposed_ to
be? Of course it's Korean (borrowed into the Korean language from
Chinese, I'm told), where it's written in a different writing system
(Hangul); so to rephrase the question, what is the appropriate
transliteration into Latin letters and semi-English spelling? If you go
on the web, Google finds 231 thousand instances of "Kim Jong Il", and
6300 instances of "Kim Jong II"--and for good measure, a couple hundred
instances of "Kim Jong ll" (el-el). So as corpus linguists, you can
rejoice that the corpus search gave the right answer, i.e. the one that
comes closest to an English spelling of the Korean name. But you also
have to ask, what is the status of the 6300 cases where it was spelled
wrong--are those errors, or just different data? I think I know what
Kim Jong Il would tell you...I think he would tell you the Web was both
defective and corrupted!
--
Mike Maxwell
Linguistic Data Consortium
maxwell at ldc.upenn.edu
More information about the Corpora
mailing list