[Lexicog] Re: Kirrkirr and Shoebox/Toolbox

Mike Maxwell maxwell at LDC.UPENN.EDU
Sat Aug 14 02:42:11 UTC 2004


Christopher Manning wrote:

> This is an area that I wish I knew slightly better than I do, but let me
> nevertheless offer a few thoughts.  Using a relational database is a
> passable fit if your XML data is "data-centric" XML (something like
> customer records written as XML) but if you have
> "document-centric" XML (like dictionaries, texts, etc.), it is an
> extremely poor fit.  My impression is that most programmers would
> disrecommend stuffing your XML into a relational database in these
> circumstances.

Well, that's definitely outside of my area of expertise.  Those who are
involved in the db part of the project (I was involved in the linguistic
modeling end of things for the most part) are smart, so I have to trust
what they say.

> ...While not at all involved
> in the projects, to my mind this is part of why FieldWorks has always
> seemed an overcomplex beast, whereas Shoebox is simple and clean!

Shoebox is simple, but I have never seen any clean and consistent data
in it.  Nor is it easy to change 'views' in Shoebox, e.g. between edit,
print, and html views; or e.g. having minor entries show up as
subentries vs. separate entries, etc.   I have written programs to find
and mark inconsistent data in Shoebox lexicons, and that worked fine on
one lexicon recently.  But the next lexicon has over 150 fields, and
there's no way of even wrapping my head around that number of fields,
much less making them consistent.  (Part of the complexity is that it's
a multi-dialectal lexicon, and there's no way in Sh of representing that
kind of variation except to use one SFM for each combination of 'slot'
and dialect.)

It's also not possible to substitute a different morphological parser
inside Shoebox, although that's perhaps an issue of implementation, not
necessity.

> I really don't think performance can be cited as a reason not to keep
> things as XML text in 2004.  E.g., it takes Kirrkirr 6 seconds to query
> a 10Mb XML file (I think extremely few fieldwork data sets are larger
> than this), running on a less than state of the art computer (1.1GHz
> Pentium 3M).  (Kirrkirr mainly works over a text XML file, but
> supplements text searching with a few indices.)

As I say, I keep hoping you're right :-).  Here's what I'm told by those
who know more about these things than I do: one query is one thing;
multiple queries is another.  If interlinear text files are linked to
the corresponding dictionary entries (so that changes to the dictionary
get reflected in your morpheme glossing, say), then you have hundreds,
maybe thousands of queries to display (or create) a single page of
interlinear text.  Maybe there are other ways to do it--maintaining data
integrity on multiple copies of data by synchronizing updates in the
background (in this case, reloading the parser with the revised lexicon
and re-parsing)--but it's not easy.  I did some of that in LinguaLinks,
where we wanted changes to the lexicon to be immediately reflected in
the morphological parsers, and it's tricky.  There may be work in
computer science circles on this sort of thing, I'm just ignorant.

> I think the real reason to not want to have just a text file is the
> traditional database advantages of things like allowing concurrent
> updates, doing versioning and logging, powerful general query languages,
> etc.

Yes, there's that too!

> Between these two worlds is the world of "native XML databases", which
> includes both commercial products like Tamino:
>
>   http://www2.softwareag.com/Corporate/products/tamino/default.asp
>
> and open source efforts like eXist:
>
>   http://exist.sourceforge.net/
>
> I think that really they might be the right technology in 2004 (though
> this is where I wish I knew a bit more than I do...).

Might be, but I'm afraid I'm even less knowledgeable here than you are...
--
	Mike Maxwell
	Linguistic Data Consortium
	maxwell at ldc.upenn.edu



------------------------ Yahoo! Groups Sponsor --------------------~-->
Yahoo! Domains - Claim yours for only $14.70
http://us.click.yahoo.com/Z1wmxD/DREIAA/yQLSAA/HKE4lB/TM
--------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
    lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list