[Lexicog] Sorting
Mike Maxwell
maxwell at LDC.UPENN.EDU
Mon Mar 22 18:54:49 UTC 2004
Koontz John E wrote:
> This would be necessary if some such transformations were arbitrary,
> but if the folding you mention is always done for an-, it would be
> nice to have that rule-based as well. I don't know if anything
> supports that kind of rule, though I think it would be a very useful
> thing to have along with being able to define collating orders,
> internally ordered equivalence classes, invisible characters, etc.
I haven't followed this discussion closely enough to now just what sort of
rule you need here, but it would be hard to beat the sorts of finite-state
based rules available in the Xerox finite state toolkit. This comes as a CD
with the following book:
Beesley, Kenneth R.; and Lauri Karttunen. 2003.
Finite State Morphology: CSLI Studies in
Computational Linguistics. Chicago: University of
Chicago Press.
It costs about $40. The software runs under (newer versions of) Windows,
Linux, Solaris, and Mac OS X.
> ...One could imagine using a regular
> expression to define a folding point, but it might be more practical
> in many cases to be able to define an invisible character (space, #,
> etc.) that marks a folding point.
Either way would work, if I understand what you're doing. Of course, since
it's finite state, there are limits to what you can do. I still use my
toaster to make toast, but for everything else, the Xerox tool is hard to
beat.
> Marking multiple fold points would facilitate including parts of
> entries in different places for different purposes. You might even
> want flavors of foldpoints, e.g., "root", "instrumental", etc. This
> amounts to tagged morphological parsing, or making aspects of such a
> parsing accessible to the sorting algorithm. (So perhaps this is not
> the most elegant way to do some of this?)
Unless you have a very unambiguous morphology, I would guess that you would
either have to do morphological parsing, or hand-code all your fold points.
> An important practical point to keep in mind here is that this marking
> must be invisible to searching, at least when that is desirable. If
> you search for gase you don't want the search to fail because the
> form is encoded as iga#se.
Given a Shoebox-type database, you could put the foldable forms with these
special characters in a separate field (initialized from the lexeme field).
Mike Maxwell
Linguistic Data Consortium
maxwell at ldc.upenn.edu
------------------------ Yahoo! Groups Sponsor ---------------------~-->
Buy Ink Cartridges or Refill Kits for your HP, Epson, Canon or Lexmark
Printer at MyInks.com. Free s/h on orders $50 or more to the US & Canada.
http://www.c1tracking.com/l.asp?cid=5511
http://us.click.yahoo.com/mOAaAA/3exGAA/qnsNAA/HKE4lB/TM
---------------------------------------------------------------------~->
Yahoo! Groups Links
<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/lexicographylist/
<*> To unsubscribe from this group, send an email to:
lexicographylist-unsubscribe at yahoogroups.com
<*> Your use of Yahoo! Groups is subject to:
http://docs.yahoo.com/info/terms/
More information about the Lexicography
mailing list