[Corpora-List] Kita's cost criteria

Alex Wahl awahl1 at gmail.com
Fri Apr 19 17:12:01 UTC 2013


Hello,


 I am attempting to implement a method of collocation
identification/extraction known as the cost criteria, by Kita, Kato, Omoto,
and Yano (1994). I would greatly appreciate feedback from anyone who is
familiar with this method, or from anyone who isn't but has some insight
into my problem! If you are familiar with this method: basically, my
problem is that I'm not sure what to do when recalculating the reduced cost
of a collocation, and that collocation is embedded within multiple larger
collocations. The original paper, whose link I've included below, seems not
to address this issue.

http://www.el.kyutech.ac.jp/history/oldkyutech/htdocs/nlp/abst/vol1/no1/paper2.ps


Here's the procedure:


Collocations are selected on the basis of their cost savings over smaller
units. The calculation of cost savings takes into account both collocation
frequency and size of collocation.


The procedure for collocation extraction is as follows:


1. Calculate a “reduced cost” value for each sequence “a” (up to some
specified length) in a corpus using the equation:


K(a) = (|a|-1) x f(a)


where:

K(a) = reduced cost

|a| = is the length in words of the sequence

f(a) = frequency of the sequence


2. Rank sequences according to K(a) values

3. Choose higher ranked sequences as candidates

4. Re-calculate reduced cost values for each sequence that takes into
account the fact that any given sequence may actually be contained by a
larger collocation. So, the reduced cost value needs to be reduced to take
this into account, according to the following (modified) equation:


K(a) = (|a|-1) x (f(a)-f(B))


where:

the new variable f(B) is the frequency of the larger sequence that sequence
“a” is embedded within.


What I don't understand is what to do when sequence “a” is a part of
multiple larger sequences that were also selected in step 3. For example,
consider the sequence “in spite.” Let's say it is embedded within 3 other
sequences that were found to be highly ranked in step 3: “in spite of,”
“totally in spite,” and “in spite of the.”


One possibility might be to subtract the frequencies of each of these
“super”sequences from the frequency of “in spite” when recalculating:


K(in spite) = (|in spite|-1) x (f(in spite)-f(in spite of)-f(totally in
spite)-f(in spite of the))


OK, but doesn't this seem a bit redundant since “in spite of” and “in spite
of the” overlap?


Thanks in advance for your help!


Alex Wahl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20130419/301692ef/attachment.htm>
-------------- next part --------------
_______________________________________________
UNSUBSCRIBE from this page: http://mailman.uib.no/options/corpora
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora


More information about the Corpora mailing list