[Lexicog] News and Offers from TshwaneDJe
Mike Maxwell
maxwell at LDC.UPENN.EDU
Mon Mar 23 01:07:07 UTC 2009
Jan F. Ullrich wrote:
> This CCT also checks for consistencies in the hierarchical structure
> and we have another CCT that can actually fix the position of a field
> if it is placed improperly. But we haven’t used it extensively as the
> entry structure has been established quite early on and I think I
> have been able to work with it quite consistently.
If you have a program that you regularly run over the file, I guess it
could be consistent. But as you can tell, I'm skeptical. Please don't
take that personally, it's just that I've never seen an SFM file that is
consistent.
I'm wondering whether you could write a regular expression that
describes the possible fields. Here's one that I wrote a couple years
ago to find missing or incorrect fields in another dictionary:
(lx #Headword (vernacular)
t #Tone
(c #Category (English)
(d #Definition (English)
dfr #Definition (French)
(e #Example sentence (vernacular)
g #Xltn of example sentence (English)
gfr #Xltn of example sentence (French)
)*
(l #Derived form (vernacular)
lg #Morpheme gloss of \l field (English)
lgfr #Morpheme gloss of \l field (French)
)*
(m #Comment (English)
mfr #Comment (French)
(mi #Idiom using vernacular word (Mawukakan)
mig #Translation of idiom (English)
migfr #Translation of idiom (French)
)* #Normally just one, but sometimes two
)*
(o #Loanword source (English)
ofr #Loanword source (French)
)?
v* #Variant (vernacular)
s* #Synonym (vernacular)
x* #Cross-ref
)+
)+
)
The '*' is the Kleene star, meaning zero or more; the '+' means one or
more. So for example each lexical entry (\lx) is supposed to contain
one or more "categories" (parts of speech), each of which contains one
or more definitions in both English and French. Each definition
contains zero or more example sentences, each of which has a translation
into both English and French, and so forth.
I'm not saying this is an ideal record structure, but it's what this
particular dictionary was supposed to have. We ran a dictionary
checking program over the dictionary, and the program found and marked
hundreds of errors. I'd say that's pretty typical--of course, your
mileage may vary! If you could write the regular expression describing
the desired field structure, I'd be happy to run the checking program on
your dictionary. (I could send you the checking program, but it's been
several years since I've tried it, and I'm sure it's suffered from bit
rot since then...) Or if there were issues of proprietary data, you
could send just the SFM markers with no data.
Mike Maxwell
CASL/ U MD
------------------------------------
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> Your email settings:
Individual Email | Traditional
<*> To change settings online go to:
http://groups.yahoo.com/group/lexicographylist/join
(Yahoo! ID required)
<*> To change settings via email:
mailto:lexicographylist-digest at yahoogroups.com
mailto:lexicographylist-fullfeatured at yahoogroups.com
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list