[Corpora-List] Algorithm for hypernym chaining?
Steven Bethard
bethard at stanford.edu
Fri Jun 25 14:56:04 UTC 2010
On Jun 25, 2010, at 4:50 PM, Ken Litkowski wrote:
> I think that graph algorithms (search for "algorithms for directed graphs") should do the trick efficiently. The key to your problem is that each of your entries constitutes an "edge" between "vertices". You compile a list of such edges and the algorithms then go through a depth-first search to construct the full digraph. Although these algorithms are NP-complete, the number of vertices (e.g., in WordNet) is sufficiently small that your task can be accomplished in at most a matter of a few minutes. WordNet has the added advantage that there are no cycles. (I have C++ code I'm willing to share, but it might not be the most efficient for your purposes.)
WordNet actually has some cycles. For example, in WordNet, "restrain" and "inhibit" are mutually
recursive:
# S: (v) restrain, keep, keep back, hold back (keep under control; keep in check)
"suppress a smile"; "Keep your temper"; "keep your cool"
* direct hypernym / inherited hypernym / sister term
o S: (v) inhibit, bottle up, suppress (control and refrain from showing; of
emotions, desires, impulses, or behavior)
+ direct hypernym / inherited hypernym / sister term
# S: (v) restrain, keep, keep back, hold back (keep under
control; keep in check) "suppress a smile"; "Keep your temper"; "keep your cool"
Don't know why it has such cycles, but if you're working with WordNet synsets, you'll probably run into this issue eventually.
Steve
>
> Adam Funk wrote:
>> Sorry if this is a bit off-topic, but at least it's for an NLP
>> application. I have a file of pairs of WordNet synset numbers that
>> represent hypernym-hyponym pairs, for example:
>>
>> 100001930,100001740
>> 100002137,100001740
>> 100002452,100001930
>> 100002684,100001930
>> 100003553,100002684
>> 100003993,100003553
>> 100004258,100003553
>> 100004475,100004258
>> 100005787,100004475
>> 100005930,100004475
>>
>> and I want to produce all the longest possible chains of them. The
>> output for that list is as follows:
>>
>> 100002137,100001740
>> 100002452,100001930,100001740
>> 100003993,100003553,100002684,100001930,100001740
>> 100005787,100004475,100004258,100003553,100002684,100001930,100001740
>> 100005930,100004475,100004258,100003553,100002684,100001930,100001740
>>
>> I have a program that does this correctly for small samples of the data
>> I'm supposed to process, but I've estimated (using a log plot) that it
>> will take about 10^595 seconds to run over the whole list (about 89000
>> pairs), so I think I'm doing it wrong.
>>
>> I haven't been able to find an algorithm for this, because I don't know
>> what the computational problem is called. Can anyone point me in the
>> direction of something suitable?
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>>
>
> --
> Ken Litkowski TEL.: 301-482-0237
> CL Research EMAIL: ken at clres.com
> 9208 Gue Road Home Page: http://www.clres.com
> Damascus, MD 20872-1025 USA Blog: http://www.clres.com/blog
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora
More information about the Corpora
mailing list