[Lexicog] Sorting

Sun Mar 21 03:39:10 UTC 2004

Rudolph C Troike wrote:
> For anyone still able to access MS-DOS (if you have Windows 98 or
> earlier), DOS was able to do a simple alphabetic or numerical sort for
> material in ASCII. I found it very helpful in compiling my
> Bibliography of Bibliographies of the Languages of the World, after
> having investigated
> a number of other specialized programs.

This kind of sort program is dependent on the order of the code points in
your encoding being standard.  That is, 'a' comes before 'b', which comes
before 'c' etc.  (Or if it's doing numeric sorting, something analogous.)
Most such programs are also smart enough to be able to ignore upper/lower
case, if you ask them nicely.

Unfortunately, this may only work for ASCII characters, and then only if you
adhere to a simple alphabet, without e.g. digraphs.  It used to be the case,
for example, that in Spanish, the digraph 'ch' was considered to be a
separate letter of the alphabet, to be sorted after all other instances of
the letter 'c' (including 'cu').  Most simple sort programs could not handle
that.  (I heard awhile back that Spanish no longer treats 'ch' (or 'll')
like that; I have no idea whether the difficulty in sorting that way using
simple-minded computer programs had anything to do with that change.)

Many simple sorting programs also cannot handle sorting accented characters
together with (or immediately after) the corresponding unaccented
characters--you may get acute accented lower case 'a' sorting after 'z' (and
accented upper case 'A', if you use that, sorting somewhere else).

I don't know what kind of attention to sorting was paid by the Unicode
people.  My suspicion is that for _standard_ orthographies (such as Hindi,
or Tamil, or...), where the writing system in question occupies a discrete
block of characters, that they will sort correctly.  OTOH, a simple sort
program (like the one in Win2k's cmd prompt) may throw up its hands when you
feed it Unicode, for all I know.

Some work has been done under the rubric of 'internationalization' in
allowing for sort orders to be defined on a language-particular basis.
Perhaps someone on the list can enlighten us?

Finally, if your alphabet doesn't follow one of the standards (as is likely
to be true if you're creating an orthography for a previously unwritten
language, you may be even further up the creek when it comes to using
standard sort routines.  All of which is why most lexicon programs
(including Shoebox and LinguaLinks) allow you to define language-particular
sort orders.  In addition to allowing you to sort accented and unaccented
characters the way you want them, they generally allow you to ignore
specific characters (like hyphens or apostrophes or...) for sorting, define
your own upper/lower case correspondences, etc.

    Mike Maxwell
    Linguistic Data Consortium
    maxwell at ldc.upenn.edu

Yahoo! Groups Links

<*> To visit your group on the web, go to:
     http://groups.yahoo.com/group/lexicographylist/

<*> To unsubscribe from this group, send an email to:
     lexicographylist-unsubscribe at yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
     http://docs.yahoo.com/info/terms/