[Lexicog] News and Offers from TshwaneDJe

Mike Maxwell maxwell at LDC.UPENN.EDU
Mon Mar 23 01:07:07 UTC 2009


Jan F. Ullrich wrote:
> This CCT also checks for consistencies in the hierarchical structure
> and we have another CCT that can actually fix the position of a field
> if it is placed improperly. But we haven’t used it extensively as the
> entry structure has been established quite early on and I think I
> have been able to work with it quite consistently.

If you have a program that you regularly run over the file, I guess it 
could be consistent.  But as you can tell, I'm skeptical.  Please don't 
take that personally, it's just that I've never seen an SFM file that is 
consistent.

I'm wondering whether you could write a regular expression that 
describes the possible fields.  Here's one that I wrote a couple years 
ago to find missing or incorrect fields in another dictionary:

(lx                #Headword (vernacular)
      t             #Tone
        (c          #Category (English)
          (d        #Definition (English)
           dfr      #Definition (French)
             (e     #Example sentence (vernacular)
              g     #Xltn of example sentence (English)
              gfr   #Xltn of example sentence (French)
             )*
             (l     #Derived form (vernacular)
              lg    #Morpheme gloss of \l field (English)
              lgfr  #Morpheme gloss of \l field (French)
             )*
             (m     #Comment (English)
              mfr   #Comment (French)
                (mi  #Idiom using vernacular word (Mawukakan)
                 mig #Translation of idiom (English)
                 migfr #Translation of idiom (French)
                )*  #Normally just one, but sometimes two
             )*
             (o     #Loanword source (English)
              ofr   #Loanword source (French)
             )?
             v*     #Variant (vernacular)
             s*     #Synonym (vernacular)
             x*     #Cross-ref
          )+
        )+
)

The '*' is the Kleene star, meaning zero or more; the '+' means one or 
more.  So for example each lexical entry (\lx) is supposed to contain 
one or more "categories" (parts of speech), each of which contains one 
or more definitions in both English and French.  Each definition 
contains zero or more example sentences, each of which has a translation 
into both English and French, and so forth.

I'm not saying this is an ideal record structure, but it's what this 
particular dictionary was supposed to have.  We ran a dictionary 
checking program over the dictionary, and the program found and marked 
hundreds of errors.  I'd say that's pretty typical--of course, your 
mileage may vary!  If you could write the regular expression describing 
the desired field structure, I'd be happy to run the checking program on 
your dictionary.  (I could send you the checking program, but it's been 
several years since I've tried it, and I'm sure it's suffered from bit 
rot since then...)  Or if there were issues of proprietary data, you 
could send just the SFM markers with no data.

    Mike Maxwell
    CASL/ U MD


------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/lexicographylist/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:lexicographylist-digest at yahoogroups.com 
    mailto:lexicographylist-fullfeatured at yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list