[Lexicog] dictionary software

Mike Maxwell maxwell at LDC.UPENN.EDU
Fri Mar 19 03:53:58 UTC 2004


Ron Moe wrote:
> A relational database would attempt to model the multidimensional web
> of relationships in the mental dictionary.
> ...
> The LinguaLinks program attempted to follow this model with a fair
> degree of success.

Technically, LL was an object oriented database, which is not quite the same
thing as a relational database.  The two are similar with respect to the
notion that "one piece of data is represented only once."  (The technical
term is "normalization".)  To take a simple example, the part of speech
"Noun" is only represented once in a LL database; when you want to say that
"dog" and "cat" are both nouns, what you are actually doing in LL is setting
a sort of pointer from the POS field of these two lexemes to the "Noun" POS.
This is unlike in Shoebox, where the string "Noun" appears in every noun
lexeme.

One advantage of representing each piece of data only once is that you can
avoid having typos and similar inconsistencies (which are abundant in every
Shoebox database I've ever seen: "Noun", "noun", "N", "n"...).  Another
advantage is that you (or at least the programmer) can define multiple views
of the same piece of data.  So for this Noun POS, one view might show it as
the string "Noun", another might show it as an italicized "n." (with the
period automagically supplied), another as "Sustantivo" (Spanish for
"Noun"), another as a pair of square brackets with a subscripted "N" on the
right-hand bracket (a view which I used in one of the morphological parsers
that came with LL), etc.

Relational databases and Object Oriented databases (OODBs) differ in several
ways, which involve the way that links are represented internally.  That
level of differences is probably not too relevant to this list.  In fact an
OODB can be (and often is) stored as a Relational DB.

SIL's more recent FieldWorks program is also an OODB.

> One problem they encountered was that the number
> of links grew to the point where the program slowed down to a snail's
> pace. It is my understanding that it takes a very fast computer to
> run LinguaLinks on a large database.

True, and with lots of memory.  But I believe both these points have become
moot in recent years, as computers are fast :-).

> One advantage of LinguaLinks is
> that each piece of information is only entered once. If you changed
> your orthography from 'river' to 'rivr', you would change the word
> once, and it would be correct everywhere the word "occurred" in the
> database.

As I (attempted to) described above, this is the goal.  The problem is
defining the extent to which this normalization goes.  A lot of effort went
into deciding what to normalize in LL, and more since then for FieldWorks.
Orthography changes can actually be more difficult for normalization than
some of the other sorts of changes, if the language has any morphology.  So
for example, if you change 'river' to 'rivr', you'll probably want to change
'rivers' and 'riverine' (although maybe not).  Essentially, the program
can't know what you want to do, it can only point you to all the places in
your morphologically analyzed text and other lexical entries that the lexeme
'river' appears.  And if you weren't able to parse 'riverine' before, the
program doesn't (can't) know that it contains 'river' (whereas the word
'driver' does not).  In sum, morphology can mess you up...

(Of course, there are other sorts of orthography changes, like where you
want to change "qu" to "k", along with "c" except when it appears before "i"
or "e". Normalization in LL doesn't extend to the level of individual
characters, so that kind of change has to be made some other way.)

I should also point out that LL has been accused of being inflexible, and
that is to some extent true.  That is the boundary and the price of
normalization. The normalization was designed in when the database was
built. In other words, you must choose, but choose wisely; since that is not
something that would be easy for most end users to do, it was done by the
designers.  (It's not easy for the designers to do--I speak from
experience.)

OTOH, I've seen some bad choices for the fields used in Shoebox lexical
databases; just because it's flexible doesn't mean you won't shoot yourself
in the foot with that flexibility.

    Mike Maxwell
    Linguistic Data Consortium
    maxwell at ldc.upenn.edu



------------------------ Yahoo! Groups Sponsor ---------------------~-->
Buy Ink Cartridges or Refill Kits for your HP, Epson, Canon or Lexmark
Printer at MyInks.com. Free s/h on orders $50 or more to the US & Canada.
http://www.c1tracking.com/l.asp?cid=5511
http://us.click.yahoo.com/mOAaAA/3exGAA/qnsNAA/HKE4lB/TM
---------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list