[Corpora-List] Algorithm for hypernym chaining?

Fri Jun 25 15:00:08 UTC 2010

Hi, Ken & Adam.

DFS is O(|V| + |E|).  For a fully connected graph (or nearly so), O(|E|) =
O(|V|^2).  Thus DFS is worst-case O(|V|^2), which is certainly in P (not in
NP or NP-Complete).  If you start your DFS at what appears to be your "root"
node (100001740), then you will get what you wanted.  Am I missing
something?

Regards,
--Eric

-----Original Message-----
From: corpora-bounces at uib.no [mailto:corpora-bounces at uib.no] On Behalf Of
Ken Litkowski
Sent: Friday, June 25, 2010 8:50 AM
To: Adam Funk
Cc: corpora mailing list
Subject: Re: [Corpora-List] Algorithm for hypernym chaining?

I think that graph algorithms (search for "algorithms for directed 
graphs") should do the trick efficiently. The key to your problem is 
that each of your entries constitutes an "edge" between "vertices". You 
compile a list of such edges and the algorithms then go through a 
depth-first search to construct the full digraph. Although these 
algorithms are NP-complete, the number of vertices (e.g., in WordNet) is 
sufficiently small that your task can be accomplished in at most a 
matter of a few minutes. WordNet has the added advantage that there are 
no cycles. (I have C++ code I'm willing to share, but it might not be 
the most efficient for your purposes.)

Adam Funk wrote:
> Sorry if this is a bit off-topic, but at least it's for an NLP
> application.  I have a file of pairs of WordNet synset numbers that
> represent hypernym-hyponym pairs, for example:
>
> 100001930,100001740
> 100002137,100001740
> 100002452,100001930
> 100002684,100001930
> 100003553,100002684
> 100003993,100003553
> 100004258,100003553
> 100004475,100004258
> 100005787,100004475
> 100005930,100004475
>
> and I want to produce all the longest possible chains of them.  The
> output for that list is as follows:
>
> 100002137,100001740
> 100002452,100001930,100001740
> 100003993,100003553,100002684,100001930,100001740
> 100005787,100004475,100004258,100003553,100002684,100001930,100001740
> 100005930,100004475,100004258,100003553,100002684,100001930,100001740
>
> I have a program that does this correctly for small samples of the data
> I'm supposed to process, but I've estimated (using a log plot) that it
> will take about 10^595 seconds to run over the whole list (about 89000
> pairs), so I think I'm doing it wrong.
>
> I haven't been able to find an algorithm for this, because I don't know
> what the computational problem is called.  Can anyone point me in the
> direction of something suitable?
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>   

-- 
Ken Litkowski                     TEL.: 301-482-0237
CL Research                       EMAIL: ken at clres.com
9208 Gue Road                     Home Page: http://www.clres.com
Damascus, MD 20872-1025 USA       Blog: http://www.clres.com/blog

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora

_______________________________________________
Corpora mailing list
Corpora at uib.no
http://mailman.uib.no/listinfo/corpora