<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE>sorting OHG (non-ASCII) in PERL</TITLE>
<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>
<STYLE type=text/css>BLOCKQUOTE {
PADDING-BOTTOM: 0px; PADDING-TOP: 0px
}
DL {
PADDING-BOTTOM: 0px; PADDING-TOP: 0px
}
UL {
PADDING-BOTTOM: 0px; PADDING-TOP: 0px
}
OL {
PADDING-BOTTOM: 0px; PADDING-TOP: 0px
}
LI {
PADDING-BOTTOM: 0px; PADDING-TOP: 0px
}
</STYLE>
<META content="MSHTML 5.00.3103.1000" name=GENERATOR></HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Hi,</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>if you want it quick and dirty, you can define your
own sorting routine for the</FONT></DIV>
<DIV><FONT face=Arial size=2>perl sort function.</FONT></DIV>
<DIV><FONT face=Arial size=2>I wrote an example. You could use the subs "mysort"
and "initialize" as such in a</FONT></DIV>
<DIV><FONT face=Arial size=2>Perl program provided you use the two global
variables @order and %sorthash.</FONT></DIV>
<DIV><FONT face=Arial size=2>@order should contain the exaxt ordering of letters
(including capitalized and non-capitalized letter).</FONT></DIV>
<DIV><FONT face=Arial size=2>%sorthash will be needed by the two
subs.</FONT></DIV>
<DIV><FONT face=Arial size=2>Then you need two call initialize(); first before
doing any sorting.</FONT></DIV>
<DIV><FONT face=Arial size=2>When you want to sort, you have to use "sort mysort
@list".</FONT></DIV>
<DIV><FONT face=Arial size=2>It is very important for the correct sorting that
every character that ever occurs in anything</FONT></DIV>
<DIV><FONT face=Arial size=2>you want to sort is included in the ordering, i.e.
the list @order.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>As I am not a real perl hacker, myself, it may well
be that there is some more</FONT></DIV>
<DIV><FONT face=Arial size=2>efficient way or maybe there is even a bug in
programm, but it seemed to work.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Best,</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>Jan Strunk</FONT></DIV>
<DIV><FONT face=Arial size=2><A
href="mailto:strunk@linguistics.ruhr-uni-bochum.de">strunk@linguistics.ruhr-uni-bochum.de</A></FONT></DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>An example is the following code:</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>my @order=("a", "A", "â", "Â", "b", "e", "ê",
"z");
# Has to contain a list of all ordered characters</FONT></DIV>
<DIV><FONT face=Arial
size=2> <BR>my
%sorthash;
# For quicker sorting the sub initiliaze() puts the list @order into a
hash.</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>my @strings=("a", "â", "e", "âbe", "êz", "abe",
"êza"); # Things you want to sort.</FONT></DIV>
<DIV><FONT face=Arial size=2><BR>initialize(); # Puts the ordering
into a hash of the format ("a" => 1, "A" => 2, "â" => 3, "Â" => 4,
...)</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2>my $string;<BR>foreach $string (sort mysort
@strings) { # Normal way of sorting in perl, but
sort now calls "mysort" for getting the right ordering<BR>
print $string."\n";<BR>}</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2><BR>sub mysort
{ #
Compares two elements x and y<BR> my
$word1=$a;<BR> my $word2=$b;</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2> return 0 if ($word1 eq
$word2);</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2> my @word1=split("",
$word1);<BR> my @word2=split("", $word2);</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2> while ((@word1 > 0) and
(@word2 > 0)) {<BR> my $char1=shift @word1;<BR> my $char2=shift
@word2;</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2> my
$compare=($sorthash{$char1}<=>$sorthash{$char2});</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2> return $compare if ($compare !=
0);<BR> }</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2> if (@word1) {<BR> return
1;<BR> } else {<BR> return -1;<BR>
}<BR>}</FONT></DIV>
<DIV> </DIV>
<DIV><FONT face=Arial size=2><BR>sub initialize {<BR> my
$i=1;<BR> my $entry;<BR> foreach $entry
(@order) {<BR> $sorthash{$entry}=$i;<BR> $i++;<BR>
}<BR>}<BR></FONT></DIV>
<DIV> </DIV>
<BLOCKQUOTE
style="BORDER-LEFT: #000000 2px solid; MARGIN-LEFT: 5px; MARGIN-RIGHT: 0px; PADDING-LEFT: 5px; PADDING-RIGHT: 0px">
<DIV style="FONT: 10pt arial">----- Original Message ----- </DIV>
<DIV
style="BACKGROUND: #e4e4e4; FONT: 10pt arial; font-color: black"><B>From:</B>
<A href="mailto:henning.reetz@uni-konstanz.de"
title=henning.reetz@uni-konstanz.de>Henning Reetz</A> </DIV>
<DIV style="FONT: 10pt arial"><B>To:</B> <A href="mailto:corpora@hd.uib.no"
title=corpora@hd.uib.no>corpora@hd.uib.no</A> </DIV>
<DIV style="FONT: 10pt arial"><B>Sent:</B> Tuesday, February 04, 2003 3:56
PM</DIV>
<DIV style="FONT: 10pt arial"><B>Subject:</B> [Corpora-List] sorting OHG
(non-ASCII) in PERL</DIV>
<DIV><BR></DIV>
<DIV>Hi,</DIV>
<DIV><BR></DIV>
<DIV>stupid question but perhaps the freaks can help me:</DIV>
<DIV><BR></DIV>
<DIV>we're building a database of Old High German words. Obviously, there are
some characters that are<B> not</B> in ASCII (diacritics like stress marks '
and carots ^) and chars that do not follow the 'normal' sorting order (like
'uu' for 'w'). One possibility would be to recode these chars (e.g. get rid
off the diacritics for sorting and put them back on in the output), but is
there a more elegant and general way (e.g. in case one would like to have a
long 'e' after the short 'e' etc.) so that one could use it for other scripts
as well (UTF puts chars in an order that does not necessarily reflect the
'intuitiv' sequence in a language). - Is there a modul to tell PERL which
sorting sequence one would like to use or do I have to program it
myself?</DIV>
<DIV><BR></DIV>
<DIV>Thanx for any hints.</DIV>
<DIV><BR></DIV>
<DIV>Henning Reetz</DIV></BLOCKQUOTE></BODY></HTML>