[Corpora-List] Questions about collocations and collocation extraction tools

Mark Davies Mark_Davies at byu.edu
Wed Aug 2 14:52:46 UTC 2006


Hi Nicholas,

> 3. I need to compile a collocation frequency list as general (not
> genre- or sublanguage- specific) as possible. Do you consider 
> the BNC Baby to be a corpus general enough for this task or 
> do I need to use another corpus?
> 
>   4. I need to specify frequency thresholds for the 
> collocations (or the collocation candidates to be more 
> precise). Is f >= 3 considered to be an adequate cut-off? I 
> know that I have to filter out the hapax and dis legomena, 
> but from which frequency onwards does a collocation become 
> statistically significant?

As far as BNC-specific information on collocations, you might look at
http://view.byu.edu.

This interface to the BNC allows you to look for collocates within a 20
word window, and sort by raw frequency or something akin to a z-score
for the collocates. It also allows you to limit the query to specific
registers/genres in the BNC, and to specify minimum frequency
thresholds. Finally, you can compare the collocates for a given word in
two (sets of) registers, and to compare the collocates of two competing
words, all with one simple query.

Any questions, please feel free to ask.

Best,

Mark Davies

=================================================

Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906

http://davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **

================================================= 



More information about the Corpora mailing list