[Corpora-List] ACL proceedings paper in the American National Corpus

Nancy Ide ide at cs.vassar.edu
Sat Sep 28 15:26:56 UTC 2002


On Friday, September 27, 2002, at 04:18 PM, Simon G. J. Smith wrote:

>>> Note that this applies to papers whose authors are native speakers of
>>> American English only.
>>
>> Two questions. What is your definition of native speaker? and how are
>> you
>> going to determine who meets your definition? This is not as trivial
>> as it may sound.
>
>
> No, not trivial at all. I presume, though, that since (surely) no
> records are kept on researchers' linguistic origins, they will simply
> have to ask everyone if they think they qualify: just as job
> applicants and others are asked to supply details of what they
> consider to be their ethnic origin, for statistical purposes.
>
> But I'm still curious as to what happens in the not uncommon case
> where a paper is jointly authored by native and non-native speakers.
> It can't depend purely on the linguistic origin of the person doing
> the presentation, because it's the written paper that's being
> archived, not the talk. The first-named author, perhaps? Or is it safe
> to assume that if *any* native speakers contributed, someone will have
> rendered the text into a style sufficiently native-like to qualify
> anyway. Tricky.
>

Very tricky, and unlikely to be entirely solvable. Perhaps we should
have asked instead for ACL authors who are native speakers of American
English to identify themselves ;-)

We can only do our best to identify papers written by people who have
spent the greater part of their lives (most notably, their younger
years) in the US. As for non-native speaker co-authors, this becomes
trickier, as you point out, but in principle the first author on a
paper should be the most influential in terms of the language contained
in it. In principle.

The goal of the ANC project is to compile a massive corpus that will
reflect American English usage. It has to be massive precisely so that
we can get thousands of examples in order to have a statistically
reliable sense of how the language is being used *for the most part*. I
think even if we were to be able to verify that every author in the ANC
is a bonafide native speaker of American English (assuming we could
define it), we'd get plenty of variation anyway. I assume that the BNC
did not check the pedigree of every author in the corpus either, yet we
can get a good sample of British English from that data. We're hoping
the same is true of the ANC.

That said, as you point out, this opens a huge can of worms, if only to
bring up the question of what American English is. Given the diversity
and mobility in this country, the question is even more difficult to
address than it might be for other languages/locations. Maybe the ANC
will provide a source for considering it.

=======================================================

Nancy Ide

Professor and Chair
Department of Computer Science, Vassar College
Poughkeepsie, NY 12604-0520 USA
Tel: +1 845 437-5988 Fax: +1 845 437-7498
ide at cs.vassar.edu

Chercheur Associe
Equipe Langue et Dialogue, LORIA/CNRS
Campus Scientifique - BP 239
54506 Vandoeuvre-les-Nancy FRANCE
Tel: +33 (0)3 83 59 20 47 Fax: +33 (0)3 83 41 30 79
ide at loria.fr

=======================================================



More information about the Corpora mailing list