[Corpora-List] ACL proceedings paper in the American National Corpus

Mon Sep 30 13:09:31 UTC 2002

The sort of distortion that Adam Kilgarriff cites has been with us from the
beginning.  Look at the Brown Corpus, 1 million words (so large in its day)
and look at the high frequency of the English word 'jabberwocky.'

This is really raising questions about the conceptual foundations of the
whole enterprise.  Have we assumed that 'English' is not simply a collective
term, representing a range of specializations and dialects than no one could
possible learn entirely?   Have we assumed that "I speak English"  has some
denotational sense?  If so, what?

Have we assumed that 'English' has a boundary, and it is our job to find it?
Probably not, but then we should avoid boundary-finding activities.

The part of our conceptual foundations that might be the most troublesome is
the latter one.  We tend to try to define 'English' as a category.  That
leads to set theory, and to seeking the boundaries of 'English' and
'speakers of English.'

It does not need to be so.  Perhaps we can see these categories in terms of
prototypes, and seek the central most representative cases rather than the
boundaries.  It is an alternative.

I have no idea whether such an orientation would conflict with basic
assumptions that are current in Corpus Based Linguistics.  It seems
worthwhile to ask.

Bill Mann

----- Original Message -----
From: "Adam Kilgarriff" <adam.kilgarriff at itri.brighton.ac.uk>
To: "Martin Wynne" <martin.wynne at ota.ahds.ac.uk>
Cc: <corpora at hd.uib.no>
Sent: Monday, September 30, 2002 8:14 AM
Subject: RE: [Corpora-List] ACL proceedings paper in the American National
Corpus

>
> All,
>
> I'll second Martin's point about the hazards of specialised text.
> Once you start getting a largish quantity of specialised material in
> what aspires to be a general-purpose corpus, it rapidly gets
> distorted.  My bugbear, in relation to the BNC, is GUT (Journal of
> Gastroenterology and Hepatology) of which there are 600,000 words (eg
> just 0.6%).  0.6% might not sound too large, but it is a very
> specialised text type, and means that words like
>
> gastric a
> mucosa n
> colitis n
>
> leap up into the top 8000 frequent words of English  ( a list that
> doesn't include
>
> pad v
> regulator n
> wavelength n
> prejudice v
> iron v
> voting a
> escort n
> dynasty n
>
> )
>
> This sort of thing can only be avoided by not having too much
> data of any single specialised data type.  I would recommend a limit of
> 0.5% for lingusitics papers in general, with no subspecialism (eg
> computational lingusitics, or, worse still, parsing) taking more than
> a quarter of that, and a limit of 10,000 words from any single
> document.
>
> BNC used a sample size from a single document of 40,000 words as its
> default.  However most of these documents weren't too specialised so
> ti didn't cause too many problems.  It's the combination of
> substantial samples with narrow text-types that is invidious.
>
> I've only referred to distorted frequency lists in the above.  They
> are the easiest effect of distortion to describe.  There will also be
> distortions of all sorts of other language-model components (bigrams,
> trigrams, grammars, induced lexicons etc)  - the problem is,
> it's hard to describe what or how and the distortions will usually
> go unnoticed, or even feature as "interesting discoveries about the
> lg".  That's why it's important to beware these balance issues when
> building a corpus in the first place.
>
> (And of course, "distortion" is a problem term here as it implies
> there is the possibility of a non-distorted resource.  But I won't get
> into that one here...)
>
>  > One way of avoiding this, and many other potential problems which can
be
>  > found in specialised language, would be to apply a criterion for
inclusion
>  > of texts in the corpus that they should not be too technical in nature.
>  >
>
> I'm not sure I agree here.  Technical language is an important part of
> language, and we are undermining an open-minded view of language if we
> exclude technical langauge wholesale. Maybe the corpus just needs to
> be much (MUCH) bigger so it can include substantial quantities of lots
> of different specialist text types, with none making up more than 0.1%
> of the whole (hey, I know a corpus like that, it's called the web ;-) )
>
>      adam
>
>
> NEW!! MSc and Short Courses in Lexical Computing and Lexicography
> Info at
>
> http://www.itri.brighton.ac.uk/lexicom
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Adam Kilgarriff
> Senior Research Fellow                         tel: (44) 1273 642919
> Information Technology Research Institute           (44) 1273 642900
> University of Brighton                         fax: (44) 1273 642908
> Lewes Road
> Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
> UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
>
>
>
> Martin Wynne writes:
>  > Nancy's posting set off some very different alarm bells for me. I would
like
>  > to draw attention to what I think would be another problem with the
>  > inclusion of texts from ACL proceedings in the American National
Corpus.
>  >
>  > Let me start with an interesting case which I came across some years
ago.
>  > After hearing someone repeat the well-known fact that people don't say
>  > 'powerful tea' in English, I thought it would be worth checking for
>  > empirical evidence for this. I searched for the phrase in the BNC, and
got 3
>  > hits. All are from a text source listed as follows:
>  >
>  >  "Large vocabulary semantic analysis for text recognition.
>  >  Rose, Tony Gerard, u.p.. Sample containing about 42217 words of
unpublished
>  > miscellanea (domain: applied science)"
>  >
>  > and they are discussions of exactly the same point, i.e. the fact that
you
>  > don't say 'powerful tea'.
>  >
>  > (Incidentally, I also searched in the whole Bank of English and found
no
>  > hits for "powerful tea", and 39 hits for "weak tea", so the original
point
>  > is not disproven.)
>  >
>  > In ACL articles you will also get citations of made-up examples like
this,
>  > plus listings of 'ungrammatical' sentences. Basically, this problem
seems to
>  > boil down to the fact that you get a lot of 'mention' rather than 'use'
of
>  > words and phrases in academic linguistic literature, and this could
have a
>  > fairly significant effect on the results of linguistic analysis of the
>  > corpus. If one of the main reasons for building the corpus is to enable
>  > researchers to analyse naturally occurring American English, in order
to see
>  > what does occur and what doesn't, then letting in lots of made-up
example
>  > sentences and phrases would make it less fit for the proposed purpose.
>  >
>  > One way of avoiding this, and many other potential problems which can
be
>  > found in specialised language, would be to apply a criterion for
inclusion
>  > of texts in the corpus that they should not be too technical in nature.
>  >
>  > __
>  > Martin Wynne
>  > martin.wynne at ota.ahds.ac.uk
>  > Linguistics Officer
>  > Oxford Text Archive
>  >
>  > Oxford University Computing Services
>  > 13 Banbury Road
>  > Oxford
>  > UK - OX2 6NN
>  > Tel: +44 1865 283299
>  > Fax: +44 1865 273275
>  >
>
> --
> NEW!! MSc and Short Courses in Lexical Computing and Lexicography
> Info at
>
> http://www.itri.brighton.ac.uk/lexicom
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> Adam Kilgarriff
> Senior Research Fellow                         tel: (44) 1273 642919
> Information Technology Research Institute           (44) 1273 642900
> University of Brighton                         fax: (44) 1273 642908
> Lewes Road
> Brighton BN2 4GJ         email:      Adam.Kilgarriff at itri.bton.ac.uk
> UK                       http://www.itri.bton.ac.uk/~Adam.Kilgarriff
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>