[Corpora-List] grep or perl concordancer?
Martin Kay
kay at csli.Stanford.EDU
Thu Jul 25 22:50:26 UTC 2002
On Thu, 25 Jul 2002, Tony Berber Sardinha wrote:
> Dear colleagues
>
> I wonder if anyone has a bit of perl or unix (grep, etc) script that can
> generate KWIC concordances from plain text? I found some awk script for this in
> Ken Church's Unix for Poets.
>
> thanks very much in advance
>
> cheers
> tony.
> -------------------------------------
> Dr Tony Berber Sardinha
> LAEL, PUC/SP
> (Catholic University of Sao Paulo, Brazil)
> tony4 at uol.com.br
> http://lael.pucsp.br/~tony
> [New website]
>
You might try something like this. No guarantees given or implied!
--Martin Kay
--------------------------------------
#!/usr/bin/perl
$help_text=<<EOF;
------------------------------------------------------------------------
kwic [switches] pattern files
Search for a pattern given by a Perl regular expression in a file and
show the matches in key-word-in-context (KWIC) format. Use a pattern
like "/s/S+/s" to get all words. The program makes sure that there is
a space before the first word on each line to make this easy.
switches -l: length of left context.
-r: length of right context.
-i: length of id at the beginning of each line. None if the
switch is omitted.
-s: sort by key and right context. No sort if switch is
omitted.
The value that goes with a switch follows it with no intervening space.
------------------------------------------------------------------------
EOF
$leftlen=30;
$rightlen=30;
$id_width=0;
$output=STDOUT;
while($_=@ARGV[0], /^-/)
{ shift;
/^-l(.*)/ && ($leftlen=$1);
/^-r(.*)/ && ($rightlen=$1);
/^-i(.*)/ && ($id_width=$1);
/^-s/ && ($do_sort=1)
}
if(@ARGV>=1)
{ $pattern=shift @ARGV;
if($do_sort)
{ $key_loc=$id_width+$leftlen;
$output = open(OUT, "| sort -t '' +0.$key_loc");
$output = OUT
}
while(<>)
{ chomp;
s/\s+ / /g; # Remove multiple white space
if($id_width)
{ $id = substr($_, 0, $id_width);
$_ = " " . substr($_, $id_width)
}
$left="";
while(/$pattern/g)
{ $left .= $`;
$right=$&.$';
if(($len = length($left))<$leftlen)
{ $left=" " x ($leftlen-$len).$left; # Pad on left
}
elsif($len>$leftlen) # or truncate left
{ $left=substr($left, -$leftlen);
}
if(($len=length($right)) > $rightlen) # Truncate right
{ $right=substr($right, 0, $rightlen);
};
print $output "$id" if($id_width);
print $output "$left$right\n";
$left .= $&;
}
};
if($do_sort)
{ close($output)
}
}
else
{ print "Usage:\n$help_text";
}
More information about the Corpora
mailing list