[Corpora-List] Surprisingly large MI scores
Nick Ellis
ncellis at umich.edu
Thu Sep 10 18:01:09 UTC 2009
1) Mi2 or MI10?
The MI for a 2-gram (a b) is calculated
log2(p(a,b)/(p(a)*p(b))
The original application of MI to collocation by Church and Hanks used
log2, see also the Oakes, Statistics for Corpus Linguistics book.
AntConc calculates MI in this way. Laurence Anthony in the AntConc
readme file says that for MI he uses “equations described in M.
Stubbs, Collocations and Semantic Profiles, Functions of Language 2, 1
(1995)”. In that article, Mike Stubbs refers to Church and Hanks
1990. See also http://ell.phil.tu-chemnitz.de/analysis/collocations.html
.
We believe that Collocate, the package we used in the Ellis & Simpson-
Vlach research, uses log2 too. Although the Collocate manual does not
describe the formula it uses, it generates the same MI values for
bigrams as AntConc (give or take a bit, probably depending on
definitions of what is a word, etc.).
The extension to a 3 gram (a b c) is
log2(p(a,b,c)/(p(a)*p(b)*p(c))
In our (Ellis & Simpson-Vlach, 2008) written academic corpus (The
academic writing corpus consisted of Hyland’s (2004) research article
corpus (1.2 million words), plus selected BNC files (931,000 words)),
our analyses generated the following MIs for 6 example n-grams in the
1st column of data. Mark Davies' from BNC (from his latest e-mail to
the list on this issue) are shown in the second column.
n-gram
our corpus & MIs
Davies BNC MIs
ours using log 10
the content of
5.28
2.75
1.59
is one of the
7.72
2.18
2.32
a kind of
7.02
3.52
2.11
the extent to which
14.81
2.18
4.46
in other words
12.01
4.39
3.61
a great deal of
20.39
2.94
6.14
We believe that Mark is using log10 in his calculations. If we do same
we get MI values as shown in the final column. Remembering the
different samples, we're in the same ball-park.
Can you confirm that your interfaces (corpus.byu.edu/bnc/) produce MI
calculated as log10, Mark?
If some of us are using log2 and others log10, there’s no problem of
comparability within study, and we need simply use the scaling factor
of 3.3219 across studies. But there is scope for error if we are not
clear about our units (remember the Mars Orbiter).
We should be explicit in our reports. Perhaps there is reason to
standardly report either as MI2 or MI10.
2) Mi is sensitive to n-gram length
MI is sensitive to length of string. Longer formulaic sequences are
rarer – see Newell on this. As I said in an earlier reply to Brett
Reynolds, we (Ellis, O'Donnell, Römer, Gries, & Wulff, 2009) have
calculated MI for all 2-9 grams in the whole of BNCBaby occurring 12+
times, for each N we found the median MI, resulting in:
N Median MI
2 2.234
3 6.723
4 13.085
5 20.835
6 38.925
7 53.612
8 69.046
9 79.962
We did this by formula, not using Collocate, but again log2. The
slides from this talk can be found on our Michigan Corpus Linguistics
site:
http://ctr.elicorpora.info/formulaic-language-project
Matt O’Donnell has now repeated these analyses, just to be sure, and
obtained similar results:
N MEDIAN
2 2.26475918
3 6.783017633
4 13.17090969
5 20.95527274
6 39.06738859
7 53.78959166
8 69.24835305
9 80.19228017
Word and 2-9 gram lists generated in WordSmith with a frequency
threshol of 12+ (i.e. 3 per million), then a python script to
calculate MI using the log2 formula.
The marginal mean MIs (low=3.3, medium = 6.7, high = 11) in Ellis &
Simpson-Vlach Table 1 average over strings of length n=3, 4, 5 so are
greater than one might expect for bigrams. We stratified within each
length, we did not use these values as overall thresholds.
Thanks to Matt O’Donnell and Ute Römer for working this through with
me.
Hope it clarifies,
Nick Ellis
Ellis, N. C., O’Donnell, M. B., Römer, U., Gries, S. T., & Wulff, S.
Measuring the formulaicity of language. Paper presented at AAAL 2009,
the annual conference of the American Association of Applied
Linguistics, Denver, CO. March 21-March 24, 2009.)
Ellis, N. C. & Simpson-Vlach, R. (2009). Formulaic language in native
speakers: Triangulating psycholinguistics, corpus linguistics, and
education. Corpus Linguistics and Linguistic Theory, 5, 61-78.
Newell, A. (1990). Unified theories of cognition. Cambridge, MA:
Harvard University Press.
Nick Ellis
Professor of Psychology
Research Scientist, English Language Institute
University of Michigan
Room 1011,
500 East Washington Street
Ann Arbor
MI 48104
USA
e-mail: ncellis at umich.edu
home page: Ellis
work phone: 734-647-0454
work fax : 734-763-0369
On Sep 5, 2009, at 8:48 AM, Brett Reynolds wrote:
> "Formulaic language in native speakers: Triangulating
> psycholinguistics, corpus linguistics, and education" by Nick C.
> Ellis and Rita Simpson-Vlach was recently published in _Corpus
> Linguistics and Linguistic Theory_ <http://www.reference-global.com/doi/abs/10.1515/CLLT.2009.003
> >.
>
> Therein is a table of n-grams in three columns: low, medium, and
> high MI scores. I'm going from memory, but the authors consider
> roughly MI=3 as low, MI=6 as medium, and MI=12 as high.
>
> I have only a very rudimentary understanding of MI scores, but my
> understanding is that an MI of 3 indicates a strong collocation, so
> I wrote to Nick and asked him about it. He wrote back that:
>
> -They used Mike Barlow's Collocate.
> -"MI is very sensitive to length of n-gram.
>
> For example, calculating MI for all 2-9 grams in the whole of
> BNCBaby occurring 12+ times, for each N we found the median MI,
> resulting in:
> N Median MI
> 2 2.234
> 3 6.723
> 4 13.085
> 5 20.835
> 6 38.925
> 7 53.612
> 8 69.046
> 9 79.962"
>
> I've never seen MI scores of that size. Moreover, when I looked at
> some of the n-grams that appear in the paper using Mark Davies' COCA
> and BNC interfaces, I came up with much lower numbers. Here are some
> examples: the first is the MI in the entire corpus, and the second
> is the MI in the academic subcorpus.
>
> BNC
> the content of 2.99 0.34
> is one of the 2.41 -0.24
> a kind of 4.06 1.41
> the extent to which 2.41 -0.24
> in other words 4.71 2.05
> a great deal of 3.47 0.82
>
> COCA
> the content of 3.24 0.90
> is one of the 2.66 0.31
> a kind of 4.31 1.97
> the extent to which 2.66 0.31
> in other words 4.83 2.49
> a great deal of 3.73 1.38
>
> Again, the numbers in the paper are often four times those above.
> Can anybody help me understand this discrepancy?
>
> Best,
> Brett
>
> <http://english-jack.blogspot.com>
>
> -----------------------
> Brett Reynolds
> English Language Centre
> Humber College Institute of Technology and Advanced Learning
> Toronto, Ontario, Canada
> brett.reynolds at humber.ca
>
>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
Brett,
I get quite different scores for the Academic-only queries, and these
are much more in line with what one would expect.
BNC (corpus.byu.edu/bnc)
the content of 2.75 (vs. your 0.34)
is one of the 2.18 (vs. your -0.24)
a kind of 3.52
the extent to which 2.18
in other words 4.39
a great deal of 2.94
Corpus of Contemporary American English (www.americancorpus.org)
the content of 2.94
is one of the 2.35
a kind of 3.69
the extent to which 2.35
in other words 4.45
a great deal of 3.11
Also, the MI scores from the BYU-BNC agree quite nicely with the MI
from the BNC via Sketch Engine and BNCweb. For example, for [ *
havoc ], BYU-BNC gives 16.9 for [wreak], Sketch Engine gives 17.0, and
BNCweb gives 17.1. So apparently they are all using the same MI
formula correctly. (BTW, the calculated corpus size might account for
the very small differences, since the number of "words" in the BNC
differs slightly depending on what counts as a "word").
As you've mentioned, these MI scores are much, much lower than what
Ellis et al have found. Even with a very highly idiomatic phrase like
"run amok" or "wreak havoc", MI scores are almost never above 16-17 --
certainly not up in the 60-80 range.
Feel free to email me if you need help with these.
Mark
============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20090910/6f3902d0/attachment.htm>
-------------- next part --------------
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list