[Corpora-List] Sorting upper-ASCII chars in Unix

Serge HEIDEN slh at ens-lsh.fr
Mon Nov 24 21:27:59 UTC 2003


Dear William,

| I have been trying to use the Unix sort function to sort files which
| contain upper-ASCII characters (i.e. ASCII code > 127) on a machine with
| locale, language and charset set to US English.  Lower-ASCII characters
| and some upper-ASCII characters sort fine, but some upper-ASCII
| characters (specifically some non-alphanumeric ones) are left in
| semi-random order.
|
| How should the relevant environmental variables be set to permit sorting
| files in straight ASCII order?

A typical Unix manual will tel you that "lines are ordered according  to  the
collating sequence of the current locale".
A locale is defined by a langage AND a charset. For example on my Unix box,
I have :
- en_GB.ISO8859-1
- en_GB.ISO8859-15
- en_GB.ISO8859-15 at euro
- ...
Each locale defines its own collating sequence (and a lot of other things).
A collating sequence defines how one or groups of character code elements
are ordered.
If we suppose that you select a locale which associates en_US (for american
english language lexical collating sequence) with a charset containing codes above
127, the question is "What collating sequence interpretation, the person who
designed the locale has given to character codes above 127 ?"
Said differently "what collating sequence meaning has he given to characters
USUALLY not used in a specific language" ?
There are several answers to this question. One could be that no specific
collating sequence order has been defined for codes above 127 for the
en_US.* locale, which looks like what you have on your Unix box. The result
is that the order depends on the initial orderings and on the sort algorithm used
(usually quicksort). You should verify the locale definition on your Unix box.

I propose four solutions :
- buy a Unix where locale design and definition is precisely documented (I don't know any)
and pray for a coherent locale definition for codes above 127 in en_US ;
- use a locale from another language than en_US which USES the character codes
above 127 you use AND don't use different collating sequence than english.
For example fr_FR uses characters up to 255 in the ISO-Latin1 charset ;
- design your own locale : any Unix should help you to do so ;
- use a sort implementation that don't use any locale library and knows to deal
with your charset (8bit, 16bit, etc).

Cheers,

    [slh]

_____________________________________________________________________
Serge Heiden, slh at ens-lsh.fr, https://weblex.ens-lsh.fr
ENS-LSH/CNRS - ICAR UMR5191, Institut de Linguistique Française
15, parvis René Descartes 69342 Lyon BP7000 Cedex 07, tél. +33 4 37 37 63 12, fax. +33 4 37 37 62 65



More information about the Corpora mailing list