[Corpora-List] sorting OHG (non-ASCII) in PERL

Jan Strunk strunk at linguistics.ruhr-uni-bochum.de
Tue Feb 4 16:09:41 UTC 2003


sorting OHG (non-ASCII) in PERLHi,

if you want it quick and dirty, you can define your own sorting routine for the
perl sort function.
I wrote an example. You could use the subs "mysort" and "initialize" as such in a
Perl program provided you use the two global variables @order and %sorthash.
@order should contain the exaxt ordering of letters (including capitalized and non-capitalized letter).
%sorthash will be needed by the two subs.
Then you need two call initialize(); first before doing any sorting.
When you want to sort, you have to use "sort mysort @list".
It is very important for the correct sorting that every character that ever occurs in anything
you want to sort is included in the ordering, i.e. the list @order.

As I am not a real perl hacker, myself, it may well be that there is some more
efficient way or maybe there is even a bug in programm, but it seemed to work.

Best,

Jan Strunk
strunk at linguistics.ruhr-uni-bochum.de





An example is the following code:

my @order=("a", "A", "â", "Â", "b", "e", "ê", "z");                       # Has to contain a list of all ordered characters
                                                                             
my %sorthash;                      # For quicker sorting the sub initiliaze() puts the list @order into a hash.

my @strings=("a", "â", "e", "âbe", "êz", "abe", "êza");   # Things you want to sort.

initialize();   # Puts the ordering into a hash of the format ("a" => 1, "A" => 2, "â" => 3, "Â" => 4, ...)

my $string;
foreach $string (sort mysort @strings) {      # Normal way of sorting in perl, but sort now calls "mysort" for getting the right ordering
    print $string."\n";
}


sub mysort {                                             # Compares two elements x and y
    my $word1=$a;
    my $word2=$b;

    return 0 if ($word1 eq $word2);

    my @word1=split("", $word1);
    my @word2=split("", $word2);

    while ((@word1 > 0) and (@word2 > 0)) {
 my $char1=shift @word1;
 my $char2=shift @word2;

 my $compare=($sorthash{$char1}<=>$sorthash{$char2});

 return $compare if ($compare != 0);
    }

    if (@word1) {
 return 1;
    } else {
 return -1;
    }
}


sub initialize {
    my $i=1;
    my $entry;
    foreach $entry (@order) {
 $sorthash{$entry}=$i;
 $i++;
    }
}


  ----- Original Message ----- 
  From: Henning Reetz 
  To: corpora at hd.uib.no 
  Sent: Tuesday, February 04, 2003 3:56 PM
  Subject: [Corpora-List] sorting OHG (non-ASCII) in PERL


  Hi,


  stupid question but perhaps the freaks can help me:


  we're building a database of Old High German words. Obviously, there are some characters that are not in ASCII (diacritics like stress marks ' and carots ^) and chars that do not follow the 'normal' sorting order (like 'uu' for 'w'). One possibility would be to recode these chars (e.g. get rid off the diacritics for sorting and put them back on in the output), but is there a more elegant and general way (e.g. in case one would like to have a long 'e' after the short 'e' etc.) so that one could use it for other scripts as well (UTF puts chars in an order that does not necessarily reflect the 'intuitiv' sequence in a language). - Is there a modul to tell PERL which sorting sequence one would like to use or do I have to program it myself?


  Thanx for any hints.


  Henning Reetz
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/corpora/attachments/20030204/9b80ed57/attachment.htm>


More information about the Corpora mailing list