[Corpora-List] ACL proceedings paper in the American National Corpus

Nancy Ide ide at cs.vassar.edu
Tue Oct 1 18:27:10 UTC 2002


On Tuesday, October 1, 2002, at 04:08 AM, Michal Sulc wrote:

> I have read some remarks to the first question given by Ms Ide. But
> nowhere there was any distinction between "points of view" that I
> consider important here. Distinction between the corpus of "production"
> (where we are interested who wrote the text in the question - whether
> he
> or she is "really" American) and "reception" (where we are interested
> in
> texts that are read by Americans and has an influence on them).
> What do ANC-builders prefer?

I think that we "ANC-builders" are working to satisfy the "ANC-users"
;-),  but this is my own take on the issue:

The idea is to have a corpus that includes data from which one can
gather information about how American English is commonly used, perhaps
in particular in various mainstream publications. Likely, you are
trying to produce some publication that will provide guidance on word
use, spelling, syntactic constructions, etc. that would most make you
sound like a native speaker and able to fully understand texts written
by and for American English speakers. Or, in the case of a
computational linguist, you want to be able to recognize or generate
lexical items or syntactic constructions that are common in, or typical
of, American English--especially those which differ from, say, British
English. Beyond this, you get into things that are correct, by American
"rules" of grammar and usage, and perfectly understandable, but  "just
not the way we would phrase it". This is usually the way in which even
the most proficient non-native speaker will eventually betray him or
herself, so it is certainly of interest for ESL.

So I would say that "production" is what we should be interested in for
the ANC. While Americans may be exposed to lots of material that shows
marks of being non-native American (we are certainly exposed to a lot
of British English texts), the interest, at least for those who want to
describe, recognize/understand, or generate American English would only
arise after the influence, if there is any, becomes evident by cropping
up significantly in texts produced by native speakers of American
English.

Footnote to the above: the plan for the ANC (dependent, of course, on
funding) is to add at least 10 million words every five years,
comprised of data produced during those five years. This would yield a
sort of "archaeological store" of American English in temporal layers
and enable consideration of the "reception" influence you mention
(albeit after the fact).



=======================================================

Nancy Ide

Professor and Chair
Department of Computer Science, Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================



More information about the Corpora mailing list