Corpora: type/token ratio
De Cock Sylvie
decock at lige.ucl.ac.be
Wed Jan 16 13:04:18 UTC 2002
Dear List members,
I'm working on recurrent sequences of words in learner and native speaker
writing (NS corpus: 106,112 words, NNS corpus: 100,575) and have a question
regarding the use of the type/token ratio to measure word combination
variation. As the 'standard' type/token ratio is not reliable when
comparing corpora of different sizes, I have used the log type/token ratio
as it is thought to remain constant for samples of different sizes (Herdan?
1960: 26).
I have a niggling worry ... I calculated both the 'standard' type/token
ration and the log type/token ratio (for NS and learner 2-, 3-, 4- and
5-word combinations) and found that the results for 5-word combinations
didn't go in the same 'direction' (see below). Should I trust the log
type/token ratio? Any help or suggestions would be welcome.
Results for 5-word combinations:
NS types: 46
NS tokens: 161
NNS types: 79
NNS tokens: 289
NS standard type/token ratio: 0.285714
NS log type/token ratio: 0.753461
NNS standard type/token ratio: 0.273356
NNS log type/token ratio: 0.771111
Thank you very much in advance.
Best wishes
Sylvie De Cock
Université catholique de Louvain
Collège Erasme
1, Place Blaise Pascal
B-1348 Louvain-la-Neuve
BELGIUM
More information about the Corpora
mailing list