Corpora: Log Likelihood Ratio and multi-word units

Cameron Smart ecsmart at polyu.edu.hk
Thu Dec 20 01:10:14 UTC 2001


Apologies if this question either betrays a fundamental misunderstanding on
my part or is old hat.

If one is employing the log likelihood ratio (or similarly Chi-Square) to
establish a significant difference in use of a certain word in two corpora,
as far as I understand, it is calculated using a contingency table based on
the Frequency of the word/ Frequency of other words/ Total number of words
in corpus.

However, how is this employed if we want to establish a significant
difference in use of a multi word unit (such as a 2 word prepositional
phrase) in two corpora? Frequency of multi-word unit is easy enough, but
what does "Frequency of other words" become? Indeed can the log likelihood
ratio be used in this case? If not what alternatives are there?


Thanks for any comments in advance

Cameron Smart
Hong Kong Polytechnic University



More information about the Corpora mailing list