[Corpora-List] grep or perl concordancer?

Martin Kay kay at csli.Stanford.EDU
Thu Jul 25 22:50:26 UTC 2002


On Thu, 25 Jul 2002, Tony Berber Sardinha wrote:

> Dear colleagues
>
> I wonder if anyone has a bit of perl or unix (grep, etc) script that can
> generate KWIC concordances from plain text? I found some awk script for this in
> Ken Church's Unix for Poets.
>
> thanks very much in advance
>
> cheers
> tony.
> -------------------------------------
> Dr Tony Berber Sardinha
> LAEL, PUC/SP
> (Catholic University of Sao Paulo, Brazil)
> tony4 at uol.com.br
> http://lael.pucsp.br/~tony
> [New website]
>

You might try something like this.  No guarantees given or implied!

--Martin Kay

--------------------------------------
#!/usr/bin/perl
$help_text=<<EOF;
------------------------------------------------------------------------
kwic [switches] pattern files

Search for a pattern given by a Perl regular expression in a file and
show the matches in key-word-in-context (KWIC) format. Use a pattern
like "/s/S+/s" to get all words.  The program makes sure that there is
a space before the first word on each line to make this easy.

switches -l: length of left context.
         -r: length of right context.
         -i: length of id at the beginning of each line.  None if the
             switch is omitted.
         -s: sort by key and right context.  No sort if switch is
             omitted.

The value that goes with a switch follows it with no intervening space.

------------------------------------------------------------------------

EOF

$leftlen=30;
$rightlen=30;
$id_width=0;
$output=STDOUT;
while($_=@ARGV[0], /^-/)
  { shift;
    /^-l(.*)/ && ($leftlen=$1);
    /^-r(.*)/ && ($rightlen=$1);
    /^-i(.*)/ && ($id_width=$1);
    /^-s/ && ($do_sort=1)
  }

if(@ARGV>=1)
  { $pattern=shift @ARGV;
  if($do_sort)
    { $key_loc=$id_width+$leftlen;
      $output = open(OUT, "| sort -t '' +0.$key_loc");
      $output = OUT
    }

  while(<>)
    { chomp;
      s/\s+ / /g;                      # Remove multiple white space
        if($id_width)
	  { $id = substr($_, 0, $id_width);
            $_ = " " . substr($_, $id_width)
	  }
        $left="";
        while(/$pattern/g)
          { $left .= $`;
            $right=$&.$';
            if(($len = length($left))<$leftlen)
              { $left=" " x ($leftlen-$len).$left;   # Pad on left
              }
            elsif($len>$leftlen)                     # or truncate left
              { $left=substr($left, -$leftlen);
              }
            if(($len=length($right)) > $rightlen)    # Truncate right
              { $right=substr($right, 0, $rightlen);
	    };
            print $output "$id" if($id_width);
            print $output "$left$right\n";
            $left .= $&;
          }
      };
    if($do_sort)
      { close($output)
      }
  }
else
  { print "Usage:\n$help_text";
  }



More information about the Corpora mailing list