[Corpora-List] extrapolating to 1 million

Orion Montoya orion at mdcclv.com
Sat May 16 23:04:46 UTC 2009


On May 16, 2009, at 3:16 PM, Oliver Mason wrote:
> But what's the point in extrapolating?  It's scientifically unsound...
> So the right answer should be: "No, you can't, and you should tell
> anybody who asks you to do this that this is not appropriate."

I've grown increasingly curious over the years, as people have posted  
frequency queries here, about what applications people are using *any*  
frequency data for, for which a high level of accuracy is important --  
and how a degraded level of accuracy would be detectable.

I don't mean to question or deny the whole premise of frequency info.  
I know frequency can be very useful for all kinds of guessing and  
deciding and other NLP things. I mean to ask  why, and to whom, it  
matters whether something shows up as the 10,000th-most-frequent-token  
in a corpus of a given size, rather than the 15,999th, or the  
26,000th.  Certainly, the first few tiers matter: top 10, top 100, top  
1,000, top 10,000;  but it seems to me that the farther you get down  
the list -- the farther out on this maybe-logarithmic scale --  the  
less meaningful any degree of accuracy becomes.

Given the arbitraryness of what might be included in any given corpus,  
any overweening degree of precision seems likely to point to false (or  
meaningless) conclusions about the "language," and only really be  
reflective of the composition of the corpus. This gives all the more  
reason to heed Adam's advice about document frequency over raw whole- 
corpus counts.

Is there a crucial application am I not thinking of?

Yrs,

Orion

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora



More information about the Corpora mailing list