[Corpora-List] Corpus size and accuracy of frequency listings

Pascual Cantos pcantos at um.es
Thu Apr 2 09:37:34 UTC 2009


hi mark,

i'd suggest you have a look at Good-Turing frequency estimation. the  
logic behind their work is approximating true propabilities  
(population) out of (a) given sample(s). this might allow you to  
recalculate sample probabilities and infere population probabilites  
and consequently evaluate the relation between size and frequency.

un abrazo desde murcia

pascual

> Mark,
>
> Nice question!
>
> I'm pretty confident it hasn't been seriously studied yet.  A critical
> factor will relate to sample sizes (eg text lengths) and whether any action
> has been taken to modify (downwards) frequencies of words occurring heavily
> in a small number of texts.  (In the Sketch Engine we use ARF 'Average
> Reduced Frequency' for this, see also Stefan Gries's recent survey of
> dispersion measures.)
>
> There are two ways to look at the question - empirical, and alanlytical. My
> hunch is that the analytical one - developing a (Zipfian) probability model
> for the corpus and exploring its consequences - will be the more
> enlightening (if tougher!): empirical approaches are easy to do and will
> give lots of data but unless they are compared to the predictions of a
> theory/model, they won't lead anywhere.
>
> Adam
>
> 2009/4/1 Mark Davies <Mark_Davies at byu.edu>
>
>> I'm looking for studies that have considered how corpus size affects the
>> accuracy of word frequency listings.
>>
>> For example, suppose that one uses a 100 million word corpus and a good
>> tagger/lemmatizer to generate a frequency listing of the top 10,000 lemmas
>> in that corpus. If one were to then take just every fifth word or every
>> fiftieth word in the running text of the 100 million word corpus (thus
>> creating a 20 million or a 2 million word corpus), how much would this
>> affect the top 10,000 lemma list? Obviously it's a function of the size of
>> the frequency list as well -- things might not change much in terms of the
>> top 100 lemmas in going from a 20 million word to a 100 million word corpus,
>> whereas they would change much more for a 20,000 lemma list. But that's
>> precisely the type of data I'm looking for.
>>
>> Thanks in advance,
>>
>> Mark Davies
>>
>> ============================================
>> Mark Davies
>> Professor of (Corpus) Linguistics
>> Brigham Young University
>> (phone) 801-422-9168 / (fax) 801-422-0906
>> Web: davies-linguistics.byu.edu
>>
>> ** Corpus design and use // Linguistic databases **
>> ** Historical linguistics // Language variation **
>> ** English, Spanish, and Portuguese **
>> ============================================
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
>
> --
> ================================================
> Adam Kilgarriff
> http://www.kilgarriff.co.uk
> Lexical Computing Ltd                   http://www.sketchengine.co.uk
> Lexicography MasterClass Ltd      http://www.lexmasterclass.com
> Universities of Leeds and Sussex       adam at lexmasterclass.com
> ================================================
>




_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list