[Corpora-List] Surprisingly large MI scores
Stefan Evert
stefan.evert at uos.de
Tue Sep 29 13:32:20 UTC 2009
> Yes, these are MI. In Sketch Engine you can see the following columns:
>
> Freq
> T-score
> MI <<<
> MI3
> log likelihood
> min. sensitivity
> salience
>
> See the bottom image at: http://corpus.byu.edu/collocates.asp
Right, sorry, I had overlooked your screenshots. Comparing these
examples to the BNCweb data, the difference could very well be due to
a missing correction factor, so the SketchEngine scores are higher by
about log2(6) = 2.585 (plus smaller differences in joint and marginal
frequencies as well as sample size).
Let me just add a short comment on Adam's remarks:
> Even if the corpus is held constant, grammatical analysis will help
> you more than choice of stat. Stefan Wermter and Udo Hahn show
> this clearly in "Collocation Extraction Based on Modifiability
> Statistics", COLING 2004, http://acl.ldc.upenn.edu/C/C04/
> C04-1141.pdf. If the gramatical analysis is right, you don't need
> any stats: you can list the good collocations by simply using raw
> frequency. The point of the stats was to rule out common grammar
> words like (in English) "a" "the" "is" that otherwise always turn up
> everywhere. But any grammatical analysis will also rule them out.
> This corresponds to requests from professional lexicographers at
> OUP, Inst of Dutch Lexicology, Slovene Lexical Database project and
> elsewhere: they want to see the (grammatically constrained)
> collocations listed according to raw frequency as well as
> 'salience' (eg logDice, see above). In the Sketch Engine, word
> sketches can be sorted according to raw frequency or logDice. (See
> options on word sketch page)
But in all the evaluation studies that I've been involved in, which
used candidate data with a specific syntactic relation, suitable
association measures performed significantly better than frequency
sorting! Our rationale for the statistical measures never was just to
weed out function words or low-frequency chance co-occurrences.
The difference between frequency ranking and the best association
measure may not always be large enough to be relevant for a commercial
lexicographer, of course, but that's an entirely different matter.
Best wishes,
Stefan
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list