<div dir="ltr">
<p style="margin-bottom:0in">Hello,</p>
<p style="margin-bottom:0in"><br>
</p>
I am attempting to implement a method of collocation identification/extraction known as the cost criteria, by Kita, Kato, Omoto, and Yano (1994). I would greatly appreciate feedback from anyone who is familiar with this method, or from anyone who isn't but has some insight into my problem! If you are familiar with this method: basically, my problem is that I'm not sure what to do when recalculating the reduced cost of a collocation, and that collocation is embedded within multiple larger collocations. The original paper, whose link I've included below, seems not to address this issue.<div>
<br></div><div><a href="http://www.el.kyutech.ac.jp/history/oldkyutech/htdocs/nlp/abst/vol1/no1/paper2.ps">http://www.el.kyutech.ac.jp/history/oldkyutech/htdocs/nlp/abst/vol1/no1/paper2.ps</a><br><br><br>Here's the procedure:<br>
<br><br>Collocations are selected on the basis of their cost savings over smaller units. The calculation of cost savings takes into account both collocation frequency and size of collocation.<br><br><br>The procedure for collocation extraction is as follows:<br>
<br><br>1. Calculate a “reduced cost” value for each sequence “a” (up to some specified length) in a corpus using the equation:<br><br><br>K(a) = (|a|-1) x f(a)<br><br><br>where:<br><br>K(a) = reduced cost<br><br>|a| = is the length in words of the sequence<br>
<br>f(a) = frequency of the sequence<br><br><br>2. Rank sequences according to K(a) values<br><br>3. Choose higher ranked sequences as candidates<br><br>4. Re-calculate reduced cost values for each sequence that takes into account the fact that any given sequence may actually be contained by a larger collocation. So, the reduced cost value needs to be reduced to take this into account, according to the following (modified) equation:<br>
<br><br>K(a) = (|a|-1) x (f(a)-f(B))<br><br><br>where:<br><br>the new variable f(B) is the frequency of the larger sequence that sequence “a” is embedded within.<br><br><br>What I don't understand is what to do when sequence “a” is a part of multiple larger sequences that were also selected in step 3. For example, consider the sequence “in spite.” Let's say it is embedded within 3 other sequences that were found to be highly ranked in step 3: “in spite of,” “totally in spite,” and “in spite of the.”<br>
<br><br>One possibility might be to subtract the frequencies of each of these “super”sequences from the frequency of “in spite” when recalculating:<br><br><br>K(in spite) = (|in spite|-1) x (f(in spite)-f(in spite of)-f(totally in spite)-f(in spite of the))<br>
<br><br>OK, but doesn't this seem a bit redundant since “in spite of” and “in spite of the” overlap?<br><br><br>Thanks in advance for your help!<br><br><br>Alex Wahl</div></div>