[Corpora-List] Corpus size and accuracy of frequency listings

Fri Apr 3 12:09:49 UTC 2009

Mark, Diana, Justin :-)

Words (all words, including function words, but lets just fix on lemmata) behave burstily. The effect of picking every 5th or 50th running word on the ranked list would depend on burstiness patterns which vary across a corpus (even for very frequent words) - indeed inside the same text (document size and document boundary affect these patterns - not just corpus size). Justin's suggestion of relating this to heterogeneity measures seems right. My guess would be that the effect for a non-burstily distributing word (eg "energy" in the tipster DOE dataset) would be smaller than for a bursty one (eg the same "energy" in the SJM dataset in Tipster). This paper gives some of these burstiness patterns. 

Avik Sarkar, Anne DeRoeck, Paul H Garthwaite. Term re-occurrence measures for analyzing style. Workshop on Stylistic Analysis Of Text For Information Access of the 28th Annual international ACM Conference on Research and Development in Information Retrieval (SIGIR 2005). Bahia. Brazil. 2005. pdf(preprint)

Does that sound like a hypothesis that could be verified?

Anne

> -----Original Message-----
> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] 
> On Behalf Of J Washtell
> Sent: 03 April 2009 12:05
> To: corpora at uib.no
> Subject: Re: [Corpora-List] Corpus size and accuracy of 
> frequency listings
> 
> Mark, Diana,
> 
> It seems to me that the question comes from a desire to 
> quantify the error in the word probabilities inferred from a 
> sample of the language. Or, conversely, to know how large 
> does one's corpus has to be before one can have a good degree 
> of confidence that a ranked list of 20,000 most frequent 
> words calculated from it is sufficiently accurate for one's needs.
> 
> It is therefore a straightforward question of statistical 
> significance is it not?
> 
> Assuming it is, then the size of the corpus is the single 
> fundamental factor, not any characteristic of the frequency 
> distribution of the words. If we have a 10 million word 
> corpus and we observe word X 10,000 times, and word Y only 
> once, we have still made precisely 10 million observations 
> with respect to each word (some negative and some positive), 
> and so the dependability of both estimates is the same (i.e. 
> the variances of the observed word frequencies are some 
> constant proportion of the true word frequencies, depending 
> on the sample size). I might be wrong, but a little 
> monte-carlo experiment in Excel seemed to confirm this.
> 
> All this is assuming your sample is truly random of course, 
> which given the heterogeneity of your average corpus, is 
> probably not true (or even perhaps meaningful to try and 
> achieve?) And knowing that it isn't doesn't help you very 
> much. Maybe you could try and make some estimate of the 
> heterogeneity of the language, and therefore the reliability 
> of this assumption, by looking at dispersion within the 
> corpus. It's not immediately clear how you'd make use of it 
> if you did.
> 
> It is also quite possible that I've missed the point entirely :-)
> 
> Justin Washtell
> University of Leeds
> 
> 
> 
> Quoting Diana Santos <Diana.Santos at sintef.no>:
> 
> > Dear Mark,
> > I don't think your question makes much sense -- possibly 
> because you  
> > fail to explain what is the purpose of your frequency 
> lists. I would  
> > expect that, depending on the goal, completely different answers  
> > would be possible and would make sense.
> >
> > Apparently, you are concerned with ranking (not the actual 
> frequency  
> > numbers but the order in the list). Is this right?
> >
> > But what would be the purpose (or usefulness) to select 
> every fifth   
> > occurrence of a word in a corpus? What linguistic function would   
> > that have? Certainly not a computational function (we are 
> not in a   
> > time where we have to spare computing power in counting :-)
> >
> > I would strongly suggest that if you want to reduce your 
> corpus to a  
> >  fifth, that you still keep utterances that make sense -- 
> that is,   
> > you should keep every fifth sentence of your corpus, not word.
> >
> > Also, the notion of word is quite fluid -- to say the 
> least. So if you 
> > are working with lemmata (?), is "in spite of" a word, or three?
> > Is "Mark Davies" a word, or two? I suppose you first lemmatize your 
> > corpus, then select... but you may be aware that these kind of 
> > decisions have an enormous impact. See Santos et al. (2003) for a 
> > detailed presentation of differences in tokenization (not even
> > lemmatization!) between different groups in Morfolimpíadas (for 
> > Portuguese), together with quantitative data.
> >
> > In any case, depending on the reason why you want the frequency   
> > lists I would suggest different ways to go/model your 
> problem. Can   
> > you be more specific?
> >
> > These references (for completely different purposes) may 
> also help you:
> >
> > Katz, Slava M. 1996. "Distribution of content words and 
> phrases in   
> > text and language modelling", Natural Language Engineering 
> 2 (1996),  
> > pp.15-59.
> >
> > Berber Sardinha, Tony. 2000. "Comparing corpora with WordSmith   
> > Tools: How large must the reference corpus be?", in Adam 
> Kilgarriff   
> > & Tony Berber Sardinha (eds.), Proceedings of The Workshop on   
> > Comparing Corpora, Held in conjunction with The 38th Annual 
> Meeting   
> > of the Association for Computational Linguistics, 7 October 2000,   
> > Hong Kong University of Science and Technology (HKUST), 
> Hong Kong,    
> > http://acl.eldoc.ub.rug.nl/mirror/W/W00/W00-0902.pdf
> >
> > Evert, Stefan. 2006. "How random is a corpus? The library 
> metaphor".  
> > Zeitschrift für Anglistik und Amerikanistik 54 (2), 177 - 190.  
> > http://purl.org/stefan.evert/PUB/Evert2006.pdf
> >
> >
> > Santos, Diana, Luís Costa & Paulo Rocha. 2003. "Cooperatively 
> > evaluating Portuguese morphology". In Nuno J. Mamede, Jorge 
> Baptista, 
> > Isabel Trancoso & Maria das Graças Volpe Nunes (eds.), 
> Computational 
> > Processing of the Portuguese Language: 6th International Workshop, 
> > PROPOR 2003. Faro, Portugal, June 2003 (PROPOR 2003) 2003, 
> > Berlin/Heidelberg : Springer Verlag, pp.
> > 259-266.  
> > 
> http://www.linguateca.pt/Diana/download/SantosCostaRochaPROPOR2003.pdf
> >
> > For Portuguese, through the ACD/DC project, we have very detailed   
> > frequency lists for 22 different corpora, both for forms, for   
> > lemmata per PoS, and for lemmata irespective of PoS. You 
> may want to  
> > consult those also to get inspiration for your hypotheses:
> >
> > www.linguateca.pt/ACDC/ Choose Frequência on the lefthand side menu.
> >
> > Hope to have been of some help,
> > Greetings,
> > Diana
> >
> >
> >> -----Original Message-----
> >> From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On 
> >> Behalf Of Mark Davies
> >> Sent: 2. april 2009 00:53
> >> To: corpora at uib.no
> >> Subject: [Corpora-List] Corpus size and accuracy of frequency 
> >> listings
> >>
> >> I'm looking for studies that have considered how corpus 
> size affects 
> >> the accuracy of word frequency listings.
> >>
> >> For example, suppose that one uses a 100 million word corpus and a 
> >> good tagger/lemmatizer to generate a frequency listing of the top 
> >> 10,000 lemmas in that corpus. If one were to then take just every 
> >> fifth word or every fiftieth word in the running text of the 100 
> >> million word corpus (thus creating a 20 million or a 2 
> million word 
> >> corpus), how much would this affect the top 10,000 lemma list? 
> >> Obviously it's a function of the size of the frequency 
> list as well 
> >> -- things might not change much in terms of the top 100 lemmas in 
> >> going from a 20 million word to a 100 million word corpus, whereas 
> >> they would change much more for a 20,000 lemma list. But that's 
> >> precisely the type of data I'm looking for.
> >>
> >> Thanks in advance,
> >>
> >> Mark Davies
> >>
> >> ============================================
> >> Mark Davies
> >> Professor of (Corpus) Linguistics
> >> Brigham Young University
> >> (phone) 801-422-9168 / (fax) 801-422-0906
> >> Web: davies-linguistics.byu.edu
> >>
> >> ** Corpus design and use // Linguistic databases **
> >> ** Historical linguistics // Language variation **
> >> ** English, Spanish, and Portuguese ** 
> >> ============================================
> >>
> >>
> >> _______________________________________________
> >> Corpora mailing list
> >> Corpora at uib.no
> >> http://mailman.uib.no/listinfo/corpora
> >>
> > _______________________________________________
> > Corpora mailing list
> > Corpora at uib.no
> > http://mailman.uib.no/listinfo/corpora
> >
> 
> 
> 
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
> 

---------------------------------
The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302).

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora