[Corpora-List] Surprisingly large MI scores

Tue Sep 29 09:56:28 UTC 2009

Stefan,
> Are these really MI scores?  At least by default, the Sketch Engine
calculates something
>  different, which Adam calls a "salience score".

MI has never looked like a very good collocation statistic: it gives too
much prominence to rare items. We used to use a variant on it but it didn't
scale well so we changed a couple of years ago to a (scaled version of)
Dice, as Dice scored best in James Curran's extensive evaluation (in
relation to distributional thesauruses) in his PhD.

For bigrams (tho not for word sketches) we do offer MI as an option: go to
any concordance and click on the 'collocation' button.

Stats used in word sketches are not so transparent because we also consider
the frequency of the grammatical relation, but all is explained in the help
pages, see
Statistics used in the Sketch
Engine<http://trac.sketchengine.co.uk/attachment/wiki/SkE/DocsIndex/ske-stat.pdf?format=raw>
 at http://trac.sketchengine.co.uk/wiki/SkE/DocsIndex

I do think this question is much overrated and overresearched.  (I've
reviewed about a hundred papers on comparing stats and they're all
inconclusive.) Several things matter more than the stat you choose: above
all, how good your corpus is, in terms of size, composition and cleanliness.
 If you want better collocation lists, put some effort into
finding/building/improving the corpus instead of fussing about stats.

Even if the corpus is held constant, grammatical analysis will help you more
than choice of stat.   Stefan Wermter and Udo Hahn show this clearly in
"Collocation Extraction Based on Modifiability Statistics", COLING 2004,
http://acl.ldc.upenn.edu/C/C04/C04-1141.pdf.  If the gramatical analysis is
right, you don't need any stats: you can list the good collocations by
simply using raw frequency. The point of the stats was to rule out common
grammar words like (in English) "a" "the" "is" that otherwise always turn up
everywhere.  But any grammatical analysis will also rule them out.  This
corresponds to requests from professional lexicographers at OUP, Inst of
Dutch Lexicology, Slovene Lexical Database project and elsewhere: they want
to see the (grammatically constrained) collocations listed according to raw
frequency as well as 'salience' (eg logDice, see above).  In the Sketch
Engine, word sketches can be sorted according to raw frequency or logDice.
(See options on word sketch page)

Of course, evaluating statistics is a nice neat task but as I receive
another paper about it to review, I do sometimes think of Ralph Waldo
Emerson's 'a foolish consistency is the hobgoblin of small minds.'

Adam

2009/9/29 Stefan Evert <stefan.evert at uos.de>

>
>  Michael B. gave the MI formula from COLLOCATES as:
>>
>> MI = log2 ( ( N^(s-1) * f (x y) ) / ( f (x) * f (y) ) )
>>
>
> Not really.  What Michael gave was (shown here in a more explicit notation)
>
>        MI = log2 ( ( N^(s-1) * f (x1 x2 ... x_s) ) / ( f (x_1) * f (x_2) *
> ... * f(x_s) ) )
>
> which is the (AFAIC correct) MI equation for n-grams (where n = s), i.e.
> for a combination of s consecutive words.
>
>  I use (http://corpus.byu.edu):
>>
>> MI = log10 ( ( N * f (x,y) ) / ( f (x) * f (y) * S ) ) / log(2)
>> (divide by log(2), since LOG in SQL Server is base 10)
>>
>> where N = corpus size and S = span size.
>>
>
> This is a reasonable approximation of MI scores for surface collocations
> with a span size of S, i.e. combinations of _two_ lexemes which co-occur
> within a distance of at most S words.
>
>  Brett R. gives:
>>
>> MI = log2 ( ( N * f (x y) ) / ( f (x) * f (y) ) )       ( where is the
>> span ?)
>>
>
> This is the MI score for adjacent bigrams, which is compatible with both
> formulas above: in Michael's version, you have to set s=2 (for a bigram), in
> Mark's version, you have a window size of S=1 (for a 0L/1R span).
>
>  This is apparently the same or quite similar to what is used for BNCweb.
>>
>
> Yes, we use the correct mathematical model for surface collocations, which
> puts much more strain on the SQL database, but is usually close to your
> approximation.
>
>  One other question, I guess, is why Sketch Engine gives scores that are
>> 40-50% off what is going on with BNCweb and BYU-BNC. I'm not saying that one
>> is wrong and the other is right, but it's a bit disconcerting that the
>> scores are not more similar. Maybe everyone could "cough up" their formulas,
>> and we could see what's going on.
>>
>
> Are these really MI scores?  At least by default, the Sketch Engine
> calculates something different, which Adam calls a "salience score".
>
> Cheers,
> Stefan
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>

-- 
================================================
Adam Kilgarriff
http://www.kilgarriff.co.uk
Lexical Computing Ltd                   http://www.sketchengine.co.uk
Lexicography MasterClass Ltd      http://www.lexmasterclass.com
Universities of Leeds and Sussex       adam at lexmasterclass.com
================================================
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090929/8c162654/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora