[Lexicog] preparing Shoebox lexicons for publication/ export

List Facilitator lexicography2004 at YAHOO.COM
Tue Jan 20 19:07:21 UTC 2004


----- Original Message -----
From: "Mike Maxwell" <maxwell at ldc.upenn.edu>
To: <lexicographylist at yahoogroups.com>
Sent: Friday, January 16, 2004 6:53 AM
Subject: Re: [Lexicog] preparing Shoebox lexicons for publication/ export


> --- In lexicographylist at yahoogroups.com, Koontz John E
> <john.koontz at c...> wrote:
> > My involvement with Shoebox has probably been less extensive
> > than yours, but one thing I remember doing was passing
> > databases through AWK programs to generate keys on each
> > line that would help me sort the fields in each
> > record into some canonical order.
>
> I've use 'awk' also.  In fact, despite having been an SIL member, I
> much preferred the Unix tools to the SIL-developed tools like 'cc'.
> Personal preference (mostly :-)).
>
> But unless I'm mistaken, it's hard to use 'awk' for sorting if there
> is hierarchical structure (e.g. senses inside subentries,
> translations of example sentences under the example sentences).
> Hence my belief that XML tools would work better.  (I'm still in the
> learning curve on XML.)
>
> > A lesson I learned from Bob Hsu long ago...
>
> I have a printout of a draft dated 1994 by Bob on my desk,
> entitled "Methods of Language Data Processing", which I stole from
> Bill Poser.   Despite the fact that several bridges have been gone
> under by lots of water, Hsu's document still seems useful.  Did he
> ever publish it?
>
> > I looked at MDF, but it didn't seem very well suited to
> > Siouan languages.
> > I think you need something more powerful than a list of
> > atomic fieldnames to produce a one-size-fits-all scheme
> > of standard fields.
>
> Could you elaborate on this (probably to this list)?  One of the
> things I've been involved in (with colleagues in SIL) is one-size-
> fits-all schemes of standard _structures_ (with emphasis on
> hierarchical structure) for linguistics (morphology in particular,
> but lexical modeling too).  Lots of issues--the models tend to become
> baroque if they're to handle everything from Thai to Inuit.  And you
> want to stay as theory neutral as possible.  But those of us working
> on the models have a limited range of language backgrounds, so I'm
> always looking for languages that the models don't fit.
>
> > My special bugbear in this line,
> > right after getting the lexicographers to know and love
> > standardization, and preventing them from substituting
> > the "formatted for publication" version for the "formatted
> > for  data management" version, was recognizing that record
> > and field structure for the latter was different
> > from the record and field structure for the former.
>
> Amen, brother, preach it!
>
> > I suppose today the publication format would be XML.
>
> I guess I would say that archival storage format would be XML;
> publication format (in the sense of something people would look at,
> or something an NLP program would use) would be some transform of
> that.  That's probably what you're saying...At any rate, the hope is
> that if you can have a standard model (and possibly a standard XML
> schema), you can create standard tools that everyone (including the
> linguist who is not a programmer) can use.
>
> > Somewhat along these lines I also used to use an SIL tool
> > that printed a census of the field names in an SF database.
>
> Easy to do with standard Unix routines.  Sh5 also has a dlg box that
> does this (the "Database Type Properties" dlg), and stores the info
> in a separate file (which is how I prefer to work on it--the dlg box
> is just too limited).
>
> > In working with the Siouan Archives I set up tools to
> > produce censuses of the polygraphs used to encode the
> > characters in a 64 character computer character set, and
> > look at sequences of consonants.  This was intended to
> > help me locate errors and inconsistencies in the keypunching.
>
> By "polygraphs", you mean character n-grams, right?  I was thinking
> of mentioning s.t. like this for the possible LREC paper.  It seems
> to me I've seen this idea used somewhere as a poor man's spelling
> checker, but I can't find any citations.  Anyone?
>
>
>
>
> ------------------------ Yahoo! Groups Sponsor ---------------------~-->
> Buy Ink Cartridges or Refill Kits for your HP, Epson, Canon or Lexmark
> Printer at MyInks.com. Free s/h on orders $50 or more to the US & Canada.
> http://www.c1tracking.com/l.asp?cid=5511
> http://us.click.yahoo.com/mOAaAA/3exGAA/qnsNAA/HKE4lB/TM
> ---------------------------------------------------------------------~->
>
> Yahoo! Groups Links
>
> To visit your group on the web, go to:
>  http://groups.yahoo.com/group/lexicographylist/
>
> To unsubscribe from this group, send an email to:
>  lexicographylist-unsubscribe at yahoogroups.com
>
> Your use of Yahoo! Groups is subject to:
>  http://docs.yahoo.com/info/terms/
>
>



More information about the Lexicography mailing list