Corpora: type/token ratio

Wed Jan 16 13:04:18 UTC 2002

Dear List members,

I'm working on recurrent sequences of words in learner and native speaker 
writing (NS corpus: 106,112 words, NNS corpus: 100,575) and have a question 
regarding the use of the type/token ratio to measure word combination 
variation.  As the 'standard'  type/token ratio is not reliable when 
comparing corpora of different sizes, I have used the log type/token ratio 
as it is thought to remain constant for samples of different sizes (Herdan? 
1960: 26).
I have a niggling worry ... I calculated both the 'standard' type/token 
ration and the log type/token ratio (for NS and learner 2-, 3-, 4- and 
5-word combinations) and found that the results for 5-word combinations 
didn't go in the same 'direction' (see below).  Should I trust the log 
type/token ratio?  Any help or suggestions would be welcome.

Results for 5-word combinations:
NS types: 46
NS tokens: 161

NNS types: 79
NNS tokens: 289

NS standard type/token ratio: 0.285714
NS log type/token ratio: 0.753461
NNS standard type/token ratio: 0.273356
NNS log type/token ratio: 0.771111

Thank you very much in advance.
Best wishes

Sylvie De Cock
Université catholique de Louvain
Collège Erasme
1, Place Blaise Pascal
B-1348 Louvain-la-Neuve
BELGIUM