[Corpora-List] Questions about collocations and collocation extraction tools

Serge HEIDEN Slh at ens-lsh.fr
Wed Aug 2 09:08:19 UTC 2006


Nicholas,

Le Tuesday, August 01, 2006 11:15 PM [GMT+1=CET],
Nicholas Anagnostou <nanagnos at cis.strath.ac.uk> a écrit :

>> I am a student at Strathclyde University, Graduate School of
>> Informatics, and I am working on a dissertation project titled "Using
>> collocation frequencies in determining the relative reading
>> complexity of texts". A core part of my project is extracting
>> collocations from a corpus, in this case the BNC Baby. I have some
>> questions regarding collocations and I would be more than grateful
>> if you could share your expertise.
>>
>>   1. I wish to compare software tools that can be used for
>> collocation extraction. I wanted to include QWICK and TACT but I
>> haven't been able to locate them on the Internet. Are they publicly
>> available anymore or not? If not, is there a way to get them?

I suggest that you have a look at :
- TAPoRware (http://taporware.mcmaster.ca/) which may be designed
in the continuity of TACT (not sure of that) ;
- http://www.collocations.de/software.html

>>   2. I've found that the de facto standard for measuring the
>> statistical association between words, in order to discover
>> collocations, is the log-likelihood. Do you agree with that? Can the
>> log-likelihood be used for collocations consisting of more than two
>> words?

Again, http://www.collocations.de/ should give you a good starting point for
the panorama of all the available measures bestiary.

>>   3. I need to compile a collocation frequency list as general (not
>> genre- or sublanguage- specific) as possible. Do you consider the BNC
>> Baby to be a corpus general enough for this task or do I need to use
>> another corpus?

Althought the BNC Baby does'nt claim to be representative of the whole
BNC, it may suffer of the same typological 'text types' bias analyzed by
David Lee in his PhD dissertation. The article http://llt.msu.edu/vol5num3/lee/default.html
should give you an idea of the way he analyzes the metadata of the BNC
texts to discuss genre, register, text type, domain and style representativity
of the BNC. He designed the "BNC Index" to reclassify all the BNC texts
with a didactic perspective.
This could help you to design a corpus more oriented toward "representativity"
or "genericity/specificity" tradeoff. And especially if your "reading complexity"
analysis goal has also a didactic perspective.
Finally, I would suggest to consider that corpus compiling is time consuming
and that your corpus design strategy should include an "available time to
complete the work" component to control choices, independently of any
"soundness" principle.

>>   4. I need to specify frequency thresholds for the collocations (or
>> the collocation candidates to be more precise). Is f >= 3 considered
>> to be an adequate cut-off? I know that I have to filter out the
>> hapax and dis legomena, but from which frequency onwards does a
>> collocation become statistically significant?

I would suggest to draw a control line from the reading complexity question
to that kind of optimization threshold. Statistical significance is
something difficult to manipulate in corpus linguistics. If you use that, I would
suggest to bind it to the ultimate question you ask to the data.

>> I won't ask if there is a generally acceptable definition of a
>> collocation, because it would be like sending flame mail to the
>> list. :) Please forgive any signs of ignorance in the questions, I
>> am taking my first steps in the field.

Each applicative context has its own definition. I propose to justify this
by the fact that a "distance" between two "words" is something too
simple and biased (what is the significance of a distance in linguistics ?
what is a word in linguistics ? what is a context for two "words" to meet
in linguistics ?) to grasp any particular linguistic phenomenon. It
probably grasps a combination of MANY dependent phenomenons.
I would suggest to use the collocation definition given by the reading
complexity measure field you use.

Best,

    [Serge]

_____________________________________________________________
Serge Heiden, slh at ens-lsh.fr, https://weblex.ens-lsh.fr
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex, tél. +33(0)622003883



More information about the Corpora mailing list