[Corpora-List] extrapolating to 1 million
Orion Montoya
orion at mdcclv.com
Sat May 16 23:04:46 UTC 2009
On May 16, 2009, at 3:16 PM, Oliver Mason wrote:
> But what's the point in extrapolating? It's scientifically unsound...
> So the right answer should be: "No, you can't, and you should tell
> anybody who asks you to do this that this is not appropriate."
I've grown increasingly curious over the years, as people have posted
frequency queries here, about what applications people are using *any*
frequency data for, for which a high level of accuracy is important --
and how a degraded level of accuracy would be detectable.
I don't mean to question or deny the whole premise of frequency info.
I know frequency can be very useful for all kinds of guessing and
deciding and other NLP things. I mean to ask why, and to whom, it
matters whether something shows up as the 10,000th-most-frequent-token
in a corpus of a given size, rather than the 15,999th, or the
26,000th. Certainly, the first few tiers matter: top 10, top 100, top
1,000, top 10,000; but it seems to me that the farther you get down
the list -- the farther out on this maybe-logarithmic scale -- the
less meaningful any degree of accuracy becomes.
Given the arbitraryness of what might be included in any given corpus,
any overweening degree of precision seems likely to point to false (or
meaningless) conclusions about the "language," and only really be
reflective of the composition of the corpus. This gives all the more
reason to heed Adam's advice about document frequency over raw whole-
corpus counts.
Is there a crucial application am I not thinking of?
Yrs,
Orion
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list