[Corpora-List] Questions about collocations and collocation extraction tools
Nicholas Anagnostou
nanagnos at cis.strath.ac.uk
Tue Aug 1 21:15:20 UTC 2006
Dear all,
I am a student at Strathclyde University, Graduate School of
Informatics, and I am working on a dissertation project titled "Using
collocation frequencies in determining the relative reading complexity
of texts". A core part of my project is extracting collocations from a
corpus, in this case the BNC Baby. I have some questions regarding
collocations and I would be more than grateful if you could share your
expertise.
1. I wish to compare software tools that can be used for collocation
extraction. I wanted to include QWICK and TACT but I haven't been able
to locate them on the Internet. Are they publicly available anymore or
not? If not, is there a way to get them?
2. I've found that the de facto standard for measuring the statistical
association between words, in order to discover collocations, is the
log-likelihood. Do you agree with that? Can the log-likelihood be used
for collocations consisting of more than two words?
3. I need to compile a collocation frequency list as general (not
genre- or sublanguage- specific) as possible. Do you consider the BNC
Baby to be a corpus general enough for this task or do I need to use
another corpus?
4. I need to specify frequency thresholds for the collocations (or the
collocation candidates to be more precise). Is f >= 3 considered to be
an adequate cut-off? I know that I have to filter out the hapax and dis
legomena, but from which frequency onwards does a collocation become
statistically significant?
I won't ask if there is a generally acceptable definition of a
collocation, because it would be like sending flame mail to the list. :)
Please forgive any signs of ignorance in the questions, I am taking my
first steps in the field.
Thanks in advance and kind regards
Nicholas Anagnostou
More information about the Corpora
mailing list