[Corpora-List] Wordnet file format

Yuri Leikind YuriLeikind at scnsoft.com
Fri Aug 23 14:01:45 UTC 2002


Hello all,


Maybe someone on the list can help me to understand
how Wordnet database is organised.

Here is a typical entry in file data.verb

01288779 36 v 03 conduct 0 lead 0 direct 2 002 @ 01274998 v 0000 $ 01289007 v 0000 01 + 08 00 | lead, as in the performance of a musical composition; "conduct an orchestra; Bairenboim conducted the Chicago symphony for years"

This line represents a synset with words "conduct", "lead", and "direct"

The numbers are so-called lex_id's:

 lex_id         One digit hexadecimal integer that, when appended onto lemma, uniquely identifies a sense within a lexi╜
                      cographer file.  lex_id numbers usually start with 0, and are incremented as additional  senses  of  the
                      word  are  added  to  the same file, although there is no requirement that the numbers be consecutive or
                      begin with 0.  Note that a value of 0 is the default, and therefore  is  not  present  in  lexicographer
                      files.

Ok, I get it - "conduct" in meaning 0, "direct" in meaning 2

But in the output of the wn program the meaning are different:

1. (40) lead, take, direct, conduct, guide -- (take somebody somewhere; "We lead him to our chief"; "can you take me to the main entrance?"; "He conducted us to the palace")
..........
10. (8) conduct, lead, direct -- (lead, as in the performance of a musical composition; "conduct an orchestra; Bairenboim conducted the Chicago symphony for years")

Here, our "lead" in meaning 0 is Sense N 10.

How these sense numbers are obtained is explained in the docs:

 Sense Numbers
       Senses  in  WordNet  are  generally  ordered from most to least frequently used, with the most common sense numbered 1.
       Frequency of use is determined by the number of times a sense is tagged in  the  various  semantic  concordance  texts.
       Senses  that  are  not  semantically  tagged  follow  the ordered senses.  The tagsense_cnt field for each entry in the
       index.pos files indicates how many of the senses in the list have been tagged.

       The cntlist(5WN) file provided with the database lists the number of times each sense is tagged in the semantic concor╜
       dances.   The  data  from cntlist is used by grind(1WN) to order the senses of each word.

Now the questions:

1) Where can I see the so-called lexicographer files ?

2) What is the default lex_id, with value 0 ?

3) Sense numbers are obtained via cntlist file. I was unable to find the explanation of the format of this file:
   27 lead%2:42:12:: 3

4) If a synset can be viewed as a set of words each having a common meaning, and each word has its own lex_id, which is
   also a unique meaning identifier within one word then how is it possible that there are different synsets where one
    and the same word has the same lex_id.
   For example:

   4258476  spark_advance 0 lead 1
   3077077  jumper_cable 0 jumper_lead 0 lead 1

   To me this is nonsense, or I don't get something important.


I'd be grateful you someone enlightens me.

___
Best regards,
Yuri Leikind


To iterate is human,
to recurse is divine.



More information about the Corpora mailing list