[Corpora-List] Questions about collocations and collocation extraction tools

Tue Aug 1 21:15:20 UTC 2006

Dear all,

I am a student at Strathclyde University, Graduate School of 
Informatics, and I am working on a dissertation project titled "Using 
collocation frequencies in determining the relative reading complexity 
of texts". A core part of my project is extracting collocations from a 
corpus, in this case the BNC Baby. I have some questions regarding 
collocations and I would be more than grateful if you could share your 
expertise.

  1. I wish to compare software tools that can be used for collocation 
extraction. I wanted to include QWICK and TACT but I haven't been able 
to locate them on the Internet. Are they publicly available anymore or 
not? If not, is there a way to get them?

  2. I've found that the de facto standard for measuring the statistical 
association between words, in order to discover collocations, is the 
log-likelihood. Do you agree with that? Can the log-likelihood be used 
for collocations consisting of more than two words?

  3. I need to compile a collocation frequency list as general (not 
genre- or sublanguage- specific) as possible. Do you consider the BNC 
Baby to be a corpus general enough for this task or do I need to use 
another corpus?

  4. I need to specify frequency thresholds for the collocations (or the 
collocation candidates to be more precise). Is f >= 3 considered to be 
an adequate cut-off? I know that I have to filter out the hapax and dis 
legomena, but from which frequency onwards does a collocation become 
statistically significant?

I won't ask if there is a generally acceptable definition of a 
collocation, because it would be like sending flame mail to the list. :)
Please forgive any signs of ignorance in the questions, I am taking my 
first steps in the field.

Thanks in advance and kind regards
Nicholas Anagnostou