[Lexicog] Sorting

Mike Maxwell maxwell at LDC.UPENN.EDU
Mon Mar 22 18:54:49 UTC 2004


Koontz John E wrote:
> This would be necessary if some such transformations were arbitrary,
> but if the folding you mention is always done for an-, it would be
> nice to have that rule-based as well.  I don't know if anything
> supports that kind of rule, though I think it would be a very useful
> thing to have along with being able to define collating orders,
> internally ordered equivalence classes, invisible characters, etc.

I haven't followed this discussion closely enough to now just what sort of
rule you need here, but it would be hard to beat the sorts of finite-state
based rules available in the Xerox finite state toolkit.  This comes as a CD
with the following book:

        Beesley, Kenneth R.; and Lauri Karttunen. 2003.
        Finite State Morphology: CSLI Studies in
        Computational Linguistics. Chicago: University of
        Chicago Press.

It costs about $40.  The software runs under (newer versions of) Windows,
Linux, Solaris, and Mac OS X.

> ...One could imagine using a regular
> expression to define a folding point, but it might be more practical
> in many cases to be able to define an invisible character (space, #,
> etc.) that marks a folding point.

Either way would work, if I understand what you're doing.  Of course, since
it's finite state, there are limits to what you can do.  I still use my
toaster to make toast, but for everything else, the Xerox tool is hard to
beat.

> Marking multiple fold points would facilitate including parts of
> entries in different places for different purposes.  You might even
> want flavors of foldpoints, e.g., "root", "instrumental", etc.  This
> amounts to tagged morphological parsing, or making aspects of such a
> parsing accessible to the sorting algorithm.  (So perhaps this is not
> the most elegant way to do some of this?)

Unless you have a very unambiguous morphology, I would guess that you would
either have to do morphological parsing, or hand-code all your fold points.

> An important practical point to keep in mind here is that this marking
> must be invisible to searching, at least when that is desirable.  If
> you search for gase you don't want the search to fail because the
> form is encoded as iga#se.

Given a Shoebox-type database, you could put the foldable forms with these
special characters in a separate field (initialized from the lexeme field).

    Mike Maxwell
    Linguistic Data Consortium
    maxwell at ldc.upenn.edu



------------------------ Yahoo! Groups Sponsor ---------------------~-->
Buy Ink Cartridges or Refill Kits for your HP, Epson, Canon or Lexmark
Printer at MyInks.com. Free s/h on orders $50 or more to the US & Canada.
http://www.c1tracking.com/l.asp?cid=5511
http://us.click.yahoo.com/mOAaAA/3exGAA/qnsNAA/HKE4lB/TM
---------------------------------------------------------------------~->


Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list