[Corpora-List] ACL proceedings paper in the American National Corpus

Mon Sep 30 12:14:12 UTC 2002

All,

I'll second Martin's point about the hazards of specialised text.
Once you start getting a largish quantity of specialised material in
what aspires to be a general-purpose corpus, it rapidly gets
distorted.  My bugbear, in relation to the BNC, is GUT (Journal of
Gastroenterology and Hepatology) of which there are 600,000 words (eg
just 0.6%).  0.6% might not sound too large, but it is a very
specialised text type, and means that words like

gastric a
mucosa n
colitis n

leap up into the top 8000 frequent words of English  ( a list that
doesn't include

pad v
regulator n
wavelength n
prejudice v
iron v
voting a
escort n
dynasty n

)

This sort of thing can only be avoided by not having too much
data of any single specialised data type.  I would recommend a limit of
0.5% for lingusitics papers in general, with no subspecialism (eg
computational lingusitics, or, worse still, parsing) taking more than
a quarter of that, and a limit of 10,000 words from any single
document.

BNC used a sample size from a single document of 40,000 words as its
default.  However most of these documents weren't too specialised so
ti didn't cause too many problems.  It's the combination of
substantial samples with narrow text-types that is invidious.

I've only referred to distorted frequency lists in the above.  They
are the easiest effect of distortion to describe.  There will also be
distortions of all sorts of other language-model components (bigrams,
trigrams, grammars, induced lexicons etc)  - the problem is,
it's hard to describe what or how and the distortions will usually
go unnoticed, or even feature as "interesting discoveries about the
lg".  That's why it's important to beware these balance issues when
building a corpus in the first place.

(And of course, "distortion" is a problem term here as it implies
there is the possibility of a non-distorted resource.  But I won't get
into that one here...)

 > One way of avoiding this, and many other potential problems which can be
 > found in specialised language, would be to apply a criterion for inclusion
 > of texts in the corpus that they should not be too technical in nature.
 >

I'm not sure I agree here.  Technical language is an important part of
language, and we are undermining an open-minded view of language if we
exclude technical langauge wholesale. Maybe the corpus just needs to
be much (MUCH) bigger so it can include substantial quantities of lots
of different specialist text types, with none making up more than 0.1%
of the whole (hey, I know a corpus like that, it's called the web ;-) )

     adam

NEW!! MSc and Short Courses in Lexical Computing and Lexicography
Info at

http://www.itri.brighton.ac.uk/lexicom

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Martin Wynne writes:
 > Nancy's posting set off some very different alarm bells for me. I would like
 > to draw attention to what I think would be another problem with the
 > inclusion of texts from ACL proceedings in the American National Corpus.
 >
 > Let me start with an interesting case which I came across some years ago.
 > After hearing someone repeat the well-known fact that people don't say
 > 'powerful tea' in English, I thought it would be worth checking for
 > empirical evidence for this. I searched for the phrase in the BNC, and got 3
 > hits. All are from a text source listed as follows:
 >
 >  "Large vocabulary semantic analysis for text recognition.
 >  Rose, Tony Gerard, u.p.. Sample containing about 42217 words of unpublished
 > miscellanea (domain: applied science)"
 >
 > and they are discussions of exactly the same point, i.e. the fact that you
 > don't say 'powerful tea'.
 >
 > (Incidentally, I also searched in the whole Bank of English and found no
 > hits for "powerful tea", and 39 hits for "weak tea", so the original point
 > is not disproven.)
 >
 > In ACL articles you will also get citations of made-up examples like this,
 > plus listings of 'ungrammatical' sentences. Basically, this problem seems to
 > boil down to the fact that you get a lot of 'mention' rather than 'use' of
 > words and phrases in academic linguistic literature, and this could have a
 > fairly significant effect on the results of linguistic analysis of the
 > corpus. If one of the main reasons for building the corpus is to enable
 > researchers to analyse naturally occurring American English, in order to see
 > what does occur and what doesn't, then letting in lots of made-up example
 > sentences and phrases would make it less fit for the proposed purpose.
 >
 > One way of avoiding this, and many other potential problems which can be
 > found in specialised language, would be to apply a criterion for inclusion
 > of texts in the corpus that they should not be too technical in nature.
 >
 > __
 > Martin Wynne
 > martin.wynne at ota.ahds.ac.uk
 > Linguistics Officer
 > Oxford Text Archive
 >
 > Oxford University Computing Services
 > 13 Banbury Road
 > Oxford
 > UK - OX2 6NN
 > Tel: +44 1865 283299
 > Fax: +44 1865 273275
 >

--
NEW!! MSc and Short Courses in Lexical Computing and Lexicography
Info at

http://www.itri.brighton.ac.uk/lexicom

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Adam Kilgarriff
Senior Research Fellow                         tel: (44) 1273 642919
Information Technology Research Institute           (44) 1273 642900
University of Brighton                         fax: (44) 1273 642908
Lewes Road
Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%