[Corpora-List] extrapolating to 1 million

Fri May 15 21:39:38 UTC 2009

Sorry, Jim, law of large numbers is not relevant as it assumes independent
effects.  In language, effects are never independent (for details see Language
is never ever ever
random.<http://kilgarriff.co.uk/Publications/2005-K-lineer.pdf>)

So the short answer to Tina's question -

"Could you tell me what the frequency would be in a corpus of 1 million if I
extrapolated from the frequency of  20 in a corpus of 300K?"

 is "no".  It all depends on the structure and composition of the 300,000
corpus, the structure and composition of the (probably hypothetical) 1m
corpus, how 'bursty' the word is, and how the two corpora relate to each
other. (For burstiness, see Ken Church's "*Empirical estimates of
adaptation: the chance of two noriegas is closer to p/2 than p2"* ) If the
word in question is term-like and all 20 occurrences come from one doc, then
it is likely that the frequency in the 1m corpus will be 20 (if we include
the doc in the first corpus) or 0 (if we don't).

Extrapolation of frequencies from corpora is a risky business, highly
dependent on the sampling procedure for the corpus and the nature of the
term for which the frequency is being extrapolated.  It's generally safer to
extrapolate on the basis of document frequencies (eg, how many docs does the
word/term appear in) than word/term frequencies, though still, think hard
about the nature of the corpus and its claims to representativeness.

 Adam

2009/5/15 James L. Fidelholtz <fidelholtz at gmail.com>

> Hi, Lluis, Tina, & Al.,
>
> Firstly, the math is a little kinky (though Lluis is right--it's roughly
> OK): it should be 20 * 1M/300K, or 63.3....
>
> The point Lluis makes about the corpus containing more rarer words as we
> augment the size of the corpus is, of course, correct. Nevertheless (here I
> haven't done much work, but I just appeal to common sense and the 'law of
> large numbers' (not sure this is relevant, but 300K is a *pretty* large
> number)), we should expect, even with more obscure words to muddy up the
> picture, that the percentage of *common* words in the 300K corpus should be
> roughly the same in a corpus of 1M words, especially (but not quite only,
> for the more common words) if the corpora are selected from similar
> universes. Naturally, different selection criteria might affect even very
> common words, and it has been shown many times that the 'rarer' the words
> are, the more variable the exact percentage can be, but I wouldn't expect a
> priori that ever bigger corpora should lower the percentages of common (or
> even necessarily of rare) words. Indeed, for the hapax legomena, say, that
> enter in the new 'complement' to the corpus, their percentage even
> *increases* from 0 to 0.0001, correspondingly more for the other new words.
>
> Of course there can always be variations in the percentages. But, equally
> always, we *expect* that our sampling of the universe will give us for a
> word W something reasonably close to its real percentage frequency. And that
> when we repeat the process (or augment it), we will again get reasonably
> close to its 'real' frequency, so that we expect both frequencies to be
> close to each other. The real world often lets us down (and don't bet the
> family farm on any of this), but I guess statisticians tend to be optimists
> in this regard. And mathematicians even more (after all, we have an edge,
> and so tend to gain 5 family farms for each one we lose). In this sense,
> think: Bell curve, which, with the appropriate tweaks, is the exact
> representation of what our expectations should be in a particular case.
>
> Jim
>
>   On 5/15/09, Lluís Padró <padro at lsi.upc.edu> wrote:
>
>>   En/na Tina Waldman ha escrit:
>>
>> Dear members
>> Could you tell me what the frequency would be in a corpus of 1 million if
>> I extrapolated from the frequency of  20 in a corpus of 300K?
>>
>> Would it be 60 - 20 x 3 ?
>>
>>
>>    As a rough estimate, that may work.
>>
>>    Nevertheless, due to Zipf's laws, when you go from 300K to 1M, you're
>> getting lots of previously unseen words with very low frequencies, but they
>> modify the proability distribution
>>
>>    For this and other reasons, relative frequencies seem to be less stable
>> than that when you use larger corpora.
>>
>>    You can find out more about it in:
>> Baroni M., Evert S., "Words and echoes: assessing and mitigating the
>> non-randomness problem in word frequency distribution modeling".
>> In:Proceedings of ACL 2007, East Stroudsburg PA: ACL, 2007. p. 904-911, Atti
>> del convegno: "Association for Computational Linguistics (ACL)", Prague,
>> 23rd-30th June 2007.
>>
>>    best,
>>
>>
>> --
>> ------------------------------
>>  *Lluís Padró* Despatx Ω-S112 Campus Nord UPC C/ Jordi Girona 1-3 08034
>> Barcelona, Spain
>> Tel: +34 934 134 015 Fax: +34 934 137 833
>> padro at lsi.upc.edu <padro at lsi.upc.es> www.lsi.upc.edu/~padro<http://www.lsi.upc.es/~padro>
>>  ------------------------------
>> UNIVERSITAT POLITÈCNICA DE CATALUNYA Dept. Llenguatges i Sistemes
>> Informàtics <http://www.lsi.upc.es/> TALP <http://www.talp.upc.es/>Research Center
>> ------------------------------
>>
>
> _______________________________________________ Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
>

-- 
James L. Fidelholtz
Posgrado en Ciencias del Lenguaje
Instituto de Ciencias Sociales y Humanidades
Benemérita Universidad Autónoma de Puebla, MÉXICO
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

 --
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090515/f8a63bee/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora