[Corpora-List] Surprisingly large MI scores

Tue Sep 29 13:32:20 UTC 2009

> Yes, these are MI. In Sketch Engine you can see the following columns:
>
> Freq
> T-score
> MI  <<<
> MI3
> log likelihood
> min. sensitivity
> salience
>
> See the bottom image at: http://corpus.byu.edu/collocates.asp

Right, sorry, I had overlooked your screenshots.  Comparing these  
examples to the BNCweb data, the difference could very well be due to  
a missing correction factor, so the SketchEngine scores are higher by  
about log2(6) = 2.585 (plus smaller differences in joint and marginal  
frequencies as well as sample size).

Let me just add a short comment on Adam's remarks:

> Even if the corpus is held constant, grammatical analysis will help  
> you more than choice of stat.   Stefan Wermter and Udo Hahn show  
> this clearly in "Collocation Extraction Based on Modifiability  
> Statistics", COLING 2004, http://acl.ldc.upenn.edu/C/C04/ 
> C04-1141.pdf.  If the gramatical analysis is right, you don't need  
> any stats: you can list the good collocations by simply using raw  
> frequency. The point of the stats was to rule out common grammar  
> words like (in English) "a" "the" "is" that otherwise always turn up  
> everywhere.  But any grammatical analysis will also rule them out.   
> This corresponds to requests from professional lexicographers at  
> OUP, Inst of Dutch Lexicology, Slovene Lexical Database project and  
> elsewhere: they want to see the (grammatically constrained)  
> collocations listed according to raw frequency as well as  
> 'salience' (eg logDice, see above).  In the Sketch Engine, word  
> sketches can be sorted according to raw frequency or logDice. (See  
> options on word sketch page)

But in all the evaluation studies that I've been involved in, which  
used candidate data with a specific syntactic relation, suitable  
association measures performed significantly better than frequency  
sorting!  Our rationale for the statistical measures never was just to  
weed out function words or low-frequency chance co-occurrences.

The difference between frequency ranking and the best association  
measure may not always be large enough to be relevant for a commercial  
lexicographer, of course, but that's an entirely different matter.

Best wishes,
Stefan

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora