[Lexicog] Sorting

Koontz John E john.koontz at COLORADO.EDU
Mon Mar 22 18:08:43 UTC 2004


On Mon, 22 Mar 2004, David Frank wrote:
> It looks like what you called a sorting handle is what I was calling an
> alphabetic key when applied to a dictionary record.

Precisely.  I think I missed the implication of what you said.

> My practice and my proposal was to keep an alphabetic key as part of each
> entry, but it would be a nonprinting field and used only for sorting.

This is great if you can sort on auxiliary keys.  I think some programs
more or less unite the concepts of sorting key and headword.  I forget how
this works in Shoebox.

> An advantage of keeping it as part of the entry is that you could, for
> example, manually convert "an bagay" to BAGAY AN if you want it to sort
> after "bagay", but keep the default order in other cases, depending on
> which word in the phrase you want to use as the basis for sorting.

This would be necessary if some such transformations were arbitrary, but
if the folding you mention is always done for an-, it would be nice to
have that rule-based as well.  I don't know if anything supports that kind
of rule, though I think it would be a very useful thing to have along with
being able to define collating orders, internally ordered equivalence
classes, invisible characters, etc.

I've done things with program-based folding, e.g., extracting Siouan
instrumental roots by extracting all forms which had the right syllables
in the right place in the initial prefixes, "folding" them right after
that, and sorting together the resulting list.  E.g., i'gase => se, i'ga-,
in which ga is the instrumental, which can be preceded by a locative i-,
too, so that both ga- and i'ga- are "instrumental sequences."  One could
imagine using a regular expression to define a folding point, but it might
be more practical in many cases to be able to define an invisible
character (space, #, etc.) that marks a folding point.

Marking multiple fold points would facilitate including parts of entries
in different places for different purposes.  You might even want flavors
of foldpoints, e.g., "root", "instrumental", etc.  This amounts to tagged
morphological parsing, or making aspects of such a parsing accessible to
the sorting algorithm.  (So perhaps this is not the most elegant way to do
some of this?)

An important practical point to keep in mind here is that this marking
must be invisible to searching, at least when that is desirable.  If you
search for gase you don't want the search to fail because the form is
encoded as iga#se.

This approach can be used to handle things like sorting Gaelic surnames,
with O and M(a)c disregarded in the sort.  I gather this is common in some
applications, like phone books.




Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list