[Corpora-List] Corpus size and accuracy of frequency listings

Serge Sharoff s.sharoff at leeds.ac.uk
Thu Apr 2 09:54:26 UTC 2009


My view is somewhat different.  It's very easy to extract every fifth
word from a corpus and observe differences in the resulting lists.  The
problem in this case is that it does little to tell us how reliable our
original list is.  This approach will give some insights into the
Zipfian distribution model and its relationship to frequency lists.
Back in 1970s there were studies of this sort (I don't have a reference
at hand, will try checking a bit later, this might be a reference
mentioned in Sinclair's 1987 COBUILD collection).  Among other things
such studies predicted that a Brown-like corpus of one million words was
good enough for a reliable frequency list of the top 2000 words (figures
are rough, I'll have to check the reference).  This was based on the
notion of confidence intervals (giving 99% confidence that words in the
list are the same as words in the entire population of texts).  However,
this result cannot mean that the frequency list is in any way reliable.
Have a look at the list from the Brown corpus (lemmatised by
Treetagger):

1988 52 arthur
1989 52 stranger
1990 52 bag
1991 52 proud
1992 52 administrative
1993 52 los
1994 52 possess
1995 52 scientist
1996 52 liberty
1997 52 surround
1998 52 critic
1999 52 grin
2000 52 disappear


The problem obviously comes from the composition of the corpus: American
texts in, 'Los Angeles/Alamos' out, fiction in,  'grin' out.  Thinning
the original corpus might replace 'grin' with 'bark', but it shouldn't
change its composition on average.

In my view a more qualitative approach can yield more revealing results:
we take corpora with different compositions and find differences in
distribution between them.  What is the difference between the list of
newswires against the list of blogs against fiction, what is the
difference of ukWac against (a hypothetical) usWac, etc.  Any thoughts
on this? 

Serge


On Thu, 2009-04-02 at 09:48 +0100, Adam Kilgarriff wrote:

> Mark,
> 
> Nice question!
> 
> I'm pretty confident it hasn't been seriously studied yet.  A critical
> factor will relate to sample sizes (eg text lengths) and whether any
> action has been taken to modify (downwards) frequencies of words
> occurring heavily in a small number of texts.  (In the Sketch Engine
> we use ARF 'Average Reduced Frequency' for this, see also Stefan
> Gries's recent survey of dispersion measures.)
> 
> There are two ways to look at the question - empirical, and
> alanlytical. My hunch is that the analytical one - developing a
> (Zipfian) probability model for the corpus and exploring its
> consequences - will be the more enlightening (if tougher!): empirical
> approaches are easy to do and will give lots of data but unless they
> are compared to the predictions of a theory/model, they won't lead
> anywhere.
> 
> Adam  
> 
> 
> 2009/4/1 Mark Davies <Mark_Davies at byu.edu>
> 
>         I'm looking for studies that have considered how corpus size
>         affects the accuracy of word frequency listings.
>         
>         For example, suppose that one uses a 100 million word corpus
>         and a good tagger/lemmatizer to generate a frequency listing
>         of the top 10,000 lemmas in that corpus. If one were to then
>         take just every fifth word or every fiftieth word in the
>         running text of the 100 million word corpus (thus creating a
>         20 million or a 2 million word corpus), how much would this
>         affect the top 10,000 lemma list? Obviously it's a function of
>         the size of the frequency list as well -- things might not
>         change much in terms of the top 100 lemmas in going from a 20
>         million word to a 100 million word corpus, whereas they would
>         change much more for a 20,000 lemma list. But that's precisely
>         the type of data I'm looking for.
>         
>         Thanks in advance,
>         
>         Mark Davies
>         
>         ============================================
>         Mark Davies
>         Professor of (Corpus) Linguistics
>         Brigham Young University
>         (phone) 801-422-9168 / (fax) 801-422-0906
>         Web: davies-linguistics.byu.edu
>         
>         ** Corpus design and use // Linguistic databases **
>         ** Historical linguistics // Language variation **
>         ** English, Spanish, and Portuguese **
>         ============================================
>         
>         
>         _______________________________________________
>         Corpora mailing list
>         Corpora at uib.no
>         http://mailman.uib.no/listinfo/corpora
> 
> 
> 
> 
> -- 
> ================================================
> Adam Kilgarriff
>  http://www.kilgarriff.co.uk              
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090402/1f055d59/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list