[Lexicog] Sorting

Peter Kirk peterkirk at QAYA.ORG
Sun Mar 21 19:45:00 UTC 2004


On 21/03/2004 11:24, Koontz John E wrote:

> On Sat, 20 Mar 2004, Mike Maxwell wrote:
> > This kind of sort program is dependent on the order of the code
> points in
> > your encoding being standard.
>
> Cases where collating order of a charcter set encoding match the desired
> sorting order are probably more the exception than the rule, though
> character sets attempt within reason to match some sort of default order.
>
> There are probably various ways to handle sorting, but Bob Hsu used to
> discuss it in terms of sorting handles, which are transformations of the
> sorted elements into character strings for which the collating sequence
> does match the desired sorting order.  For example, if you want upper and
> lower case to be treated the same, convert upper case to lower case in the
> handle.  If you want a-acute to be treated like a, convert a-acute to a in
> the handle.  If you want a-acute to be treated like a, except that where
> two words differ in that respect, a-acute follows a, then convert a to
> a-acute, but append a 1 to the end of the word for each a and a 2 for each
> a-acute.  If you want ch to be treated as a single letter following c, map
> a to a, b to b, c to c, ch to d, d to e, etc.
>
> Ideally the sorting program will generate these handles on the fly as it
> needs them, based on your sorting rules, but, if you don't have access to
> a clever sorting program you can always create the handles manually
> yourself and make sure the sorting program uses them to sort with rather
> than the nominal key.  You have to delete them from some kinds of output,
> of course.
>
You all might be interested in the Unicode collation algorithm, see
http://www.unicode.org/unicode/reports/tr10/. This provides a framework
for sorting of data in multiple languages, encoded in Unicode. There is
an implementation of it as part of IBM's ICU, see
http://oss.software.ibm.com/icu/userguide/Collate_Intro.html, but this
is a code library which can be called from a program rather than a
stand-alone program. There are also Java implementations of the
collation algorithm referred to from the Unicode document, and a Perl
module at http://search.cpan.org/~jhi/perl-5.8.0/lib/Unicode/Collate.pm etc.

--
Peter Kirk
peter at qaya.org (personal)
peterkirk at qaya.org (work)
http://www.qaya.org/




Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/



More information about the Lexicography mailing list