[Corpora-List] ACL proceedings paper in the American National Corpus

Martin Wynne martin.wynne at ota.ahds.ac.uk
Mon Sep 30 11:04:39 UTC 2002


Nancy's posting set off some very different alarm bells for me. I would like
to draw attention to what I think would be another problem with the
inclusion of texts from ACL proceedings in the American National Corpus.

Let me start with an interesting case which I came across some years ago.
After hearing someone repeat the well-known fact that people don't say
'powerful tea' in English, I thought it would be worth checking for
empirical evidence for this. I searched for the phrase in the BNC, and got 3
hits. All are from a text source listed as follows:

 "Large vocabulary semantic analysis for text recognition.
 Rose, Tony Gerard, u.p.. Sample containing about 42217 words of unpublished
miscellanea (domain: applied science)"

and they are discussions of exactly the same point, i.e. the fact that you
don't say 'powerful tea'.

(Incidentally, I also searched in the whole Bank of English and found no
hits for "powerful tea", and 39 hits for "weak tea", so the original point
is not disproven.)

In ACL articles you will also get citations of made-up examples like this,
plus listings of 'ungrammatical' sentences. Basically, this problem seems to
boil down to the fact that you get a lot of 'mention' rather than 'use' of
words and phrases in academic linguistic literature, and this could have a
fairly significant effect on the results of linguistic analysis of the
corpus. If one of the main reasons for building the corpus is to enable
researchers to analyse naturally occurring American English, in order to see
what does occur and what doesn't, then letting in lots of made-up example
sentences and phrases would make it less fit for the proposed purpose.

One way of avoiding this, and many other potential problems which can be
found in specialised language, would be to apply a criterion for inclusion
of texts in the corpus that they should not be too technical in nature.

__
Martin Wynne
martin.wynne at ota.ahds.ac.uk
Linguistics Officer
Oxford Text Archive

Oxford University Computing Services
13 Banbury Road
Oxford
UK - OX2 6NN
Tel: +44 1865 283299
Fax: +44 1865 273275



More information about the Corpora mailing list