[Corpora-List] Summary - grep or perl concordancer?

Tony Berber Sardinha tony4 at uol.com.br
Mon Aug 12 14:32:53 UTC 2002


Dear list members

Thanks to everyone who responded to my query about perl /grep scripts for
generating concordances:

J.Joao Dias de Almeida
Dave Graff
Martin Kay
Petra Maier
Ronald P. Reck
Serge Sharoff
Danko Sipka
Jérome Vachey
Daniel Walker
Pete Whitelock

Daniel Walker reminded me that a similar question had already been asked on this
list: http://www.hit.uib.no/corpora/1999-3/0375.html
I apologize; I was unaware of that - a search in Google did not bring up any
pointers to this previous discussion.

Petra Maier and J.Joao Dias de Almeida suggested the Unix ptx command,
available in Linux and Windows Cygwin as well, which generates indexed
(concordances). (The cygwin version - for Windows - did not seem to work,
though.)

The other pointers and scripts appear below, separated by '=/=/=/=/=/'

Thanks again.

Tony.

=/=/=/=/=/

Danko Sipka

Take a look at:
http://main.amu.edu.pl/~sipkadan/lingo.htm
You can pick up the script from that page and see how it works.

=/=/=/=/=/

Ronald P. Reck
rreck at iama.rrecktek.com

Hi maybe this will help

see #4
http://iama.rrecktek.com/text/

http://iama.rrecktek.com/cgi-bin/cvsweb.cgi/rr-text/wordparse?rev=1.2&content-ty
pe=text/x-cvsweb-markup

=/=/=/=/=/

Serge Sharoff

there is a set of examples for using Perl in Do-it-yourself
corpus studies, including KWIC concordances at:
http://www.geocities.com/SoHo/Square/3472/program.html#scripts

=/=/=/=/=/

Dave Graff

You should check out this web site:

www.perlmonks.org

In fact, if you or any of your students/colleagues use perl a lot, you
should join "the monastery" (but you don't have to).

I tried putting "key word in context" into the "search" box at the top
of the perlmonks home page, and found a number of hits that provide code
to solve the problem.  (It turns out that someone had posed the KWIC
task as a "golfing" challenge, where perl hackers try to come in with
the "lowest score", based on the number of bytes of perl code needed to
solve the problem.)

=/=/=/=/=/

Pete Whitelock              \ Principal Research Scientist
 E-mail: pete at sharp.co.uk    \ Sharp Laboratories of Europe Ltd

Here's one that generates KWIC output for any pattern (Perl regular
expression, inc plain text)


#!/usr/local/bin/perl

# Author: Pete Whitelock
# Start Date: 15.5.94

# simple KWIC
$usage = q!

Usage: kwic [-ikb] [-c int|-l int -r int] pattern (filename)

prints pattern in $opt_c characters of context on either side
or L,R characters of context on left or right respectively (defaults to 50)

-i means case-insensitive
-k means print a tab-separated initial key in output as well
-b means print &nbsp instead of spaces for html to browser

set -c to 0 and pipe to sus (sort|uniq -c|sort -nr) to count instances of string
in file

!;

use Getopt::Std;
getopts('hikbc:l:r:');

die "$usage" if $opt_h;

$printing_key = $opt_k;
$case_insensitive = $opt_i;
$html = $opt_b;
$left_context = $opt_l || $opt_c || 50;
$right_context = $opt_r || $opt_c || 50;

$pattern = shift(@ARGV);

if ($pattern eq '') {print "$usage\n"; exit;}

# protect pattern
$pattern =~ s#/#\\/#g;

if (defined $ARGV[0]) {
  open(INPUT,$ARGV[0]) || die "Couldn't open file $ARGV[0]\n";
}
else {open (INPUT,"-");}

while(<>) {
  if ($printing_key) {
    ($key) = /^([^\t]*\t)/;
    $key =~ s#//P/Corpora and Linguistic Tools/BNCTagged/./../##;
    s/^[^\t]*\t//;
  }
  else {
    $key = '';
  }
  $key =~ s/\t/     /g;
  $key_length = length($key);

  # updated this 9.12.99 to display tabs correctly
  s/\t/     /g;
  $left = '';
  $match = '';
  if ($case_insensitive) {
    while (/($pattern)/ig) {
      &do;
    }
  }
  else {
    while (/($pattern)/g) {
      &do
    }
  }
}

sub do {

    ($left,$match,$right) = ($`,$&,$');
    $left =~ /.{0,$left_context}$/;
    $tleft = $&;
    $pad = ' ' x ($left_context - length($tleft));
    $right =~ /^.{0,$right_context}/;

    printf "%-20s", $key if $key;
    if ($html) {
      $pad =~ s/ / /g;
    }
    print $pad,$tleft,$match,$&,"\n";
  }

  exit;

=/=/=/=/=/
Martin Kay

You might try something like this.  No guarantees given or implied!

--------------------------------------
#!/usr/bin/perl
$help_text=<<EOF;
------------------------------------------------------------------------
kwic [switches] pattern files

Search for a pattern given by a Perl regular expression in a file and
show the matches in key-word-in-context (KWIC) format. Use a pattern
like "/s/S+/s" to get all words.  The program makes sure that there is
a space before the first word on each line to make this easy.

switches -l: length of left context.
         -r: length of right context.
         -i: length of id at the beginning of each line.  None if the
             switch is omitted.
         -s: sort by key and right context.  No sort if switch is
             omitted.

The value that goes with a switch follows it with no intervening space.

------------------------------------------------------------------------

EOF

$leftlen=30;
$rightlen=30;
$id_width=0;
$output=STDOUT;
while($_=@ARGV[0], /^-/)
  { shift;
    /^-l(.*)/ && ($leftlen=$1);
    /^-r(.*)/ && ($rightlen=$1);
    /^-i(.*)/ && ($id_width=$1);
    /^-s/ && ($do_sort=1)
  }

if(@ARGV>=1)
  { $pattern=shift @ARGV;
  if($do_sort)
    { $key_loc=$id_width+$leftlen;
      $output = open(OUT, "| sort -t '' +0.$key_loc");
      $output = OUT
    }

  while(<>)
    { chomp;
      s/\s+ / /g;                      # Remove multiple white space
        if($id_width)
  { $id = substr($_, 0, $id_width);
            $_ = " " . substr($_, $id_width)
  }
        $left="";
        while(/$pattern/g)
          { $left .= $`;
            $right=$&.$';
            if(($len = length($left))<$leftlen)
              { $left=" " x ($leftlen-$len).$left;   # Pad on left
              }
            elsif($len>$leftlen)                     # or truncate left
              { $left=substr($left, -$leftlen);
              }
            if(($len=length($right)) > $rightlen)    # Truncate right
              { $right=substr($right, 0, $rightlen);
    };
            print $output "$id" if($id_width);
            print $output "$left$right\n";
            $left .= $&;
          }
      };
    if($do_sort)
      { close($output)
      }
  }
else
  { print "Usage:\n$help_text";
  }

=/=/=/=/=/

jérome vachey
jvachey at free.fr

the joined perl script is a customized version of the program
provided by g.j.m. van noord on thu,  7 oct 1999 to corpora-list
in answer to a similar question.

the original script looked for one special word given as an argument.
this version produces a concordance for all words of the text.

to sort the words in the output apply a sort command with tab delimiter:

  perl couique2.pl mytext.txt | sort -t'    ' +1

------

#!/home/bin/perl -w
# ----------------------------------------------------------
# concordance (kwic) en perl trouvée sur internet
# et légèrement adaptée. voir les commentaires originaux.
# jevy, 2000-12-15
# ----------------------------------------------------------

# ----------------------------------------------------------
# jevy, 2000-12-18
# - on traite tous les mots au lieu de demander une entrée
# - tabulation devant le mot pour faciliter tris et recherches
# ----------------------------------------------------------
# couique2.pl [-f W -l W -n W -r W -s expr] [files]
# geeft _per _zin_ die met Word matcht_ de linker- en rechtercontext of Word
# - iedere match per zin gerapporteerd
# - alleen context binnen dezelfde zin
# - breedte van context wordt gegeven door $opt_l en $opt_r
# - geeft ook bestandnaam en regelnummer (breedte met $opt_n $opt_f)
# - $opt_s bepaalt hoe einde van de zin gedefinieerd is.
# ----------------------------------------------------------
# Corpora May 1999 to Jun 1999: Corpora: kwic concordances with Perl
# Corpora: kwic concordances with Perl
# Noord G.J.M. van (vannoord at let.rug.nl)
# Thu,  7 Oct 1999 17:04:47 +0200 (METDST)
#
# Christer Geisler writes:
#  > The Perl script below (adapted from Dan Malamed's 2kwic.pl) will produce
#  > kwic concordances on a match, but
#  > a) will not detect multiple occurrencences on a line,
#  > b) nor find complex patterns across several lines.
#  >
#  > Can someone suggest other ways of writing simple kwic programs in Perl?
#  > Should I split into an array, use Perl's format, etc?
#
# both a) and b) are treated by the script below. No warrenties!
# Some comments are in Dutch (which is useful for some..).
# It obtains b) by treating paragraphs at the time. It also does
# sentence splitting which might not be what you want (exercise left
# to the reader).
# ----------------------------------------------------------

use strict;
use vars qw($opt_f $opt_h $opt_l $opt_n $opt_r $opt_s);
use Getopt::Std;

# assign command line options:
getopts('f:hl:n:r:s:');

# assign default values to options
$opt_f = defined($opt_f) ? $opt_f : 0;
$opt_l = defined($opt_l) ? $opt_l : 30;
$opt_n = defined($opt_n) ? $opt_n : 0;
$opt_r = defined($opt_r) ? $opt_r : 30;
$opt_s ||= '[\.\?\!][\'\"]?\s';

# there must be at least one option remaining: the file

if ((@ARGV < 1 ) or (defined($opt_h))) { die
"
usage: $0 [-f w -l w -n w -r w -s expr] [files]
    -f argument determines width of file name (0 for full file name),
       default: $opt_f
       nb. file name is printed only if there is more than one input file.
    -h displays this help message
    -l argument determines width of left context,
       default: $opt_l
    -n argument determines width of line number field,
       default: $opt_n
    -r argument determines width of right context,
       default: $opt_r
    -s argument is a perl regular expression for end of sentence.
       default: $opt_s

";
           }

my $report_file_name=1;
$report_file_name=0 if @ARGV < 2;

# any remaining arguments are file names. If more than one file name,
# we report file name for each match.

$/="";  # reads a paragraph at a time. This gives unexpected results on
        # dos files (more like slurp them...

while(<>) {
  close ARGV if eof;               # for $. (current record nr of input)
  foreach $_ (split $opt_s) {
    tr/\n\t\r / /s;                # removes ^M, ^J, ^I
#    while (/$word/gio) {          # report each match
      while(/[^\s\(\["]+/gio) {    # tous les mots
      if ($report_file_name) {
 printf("%*s ",$opt_f,
        length($ARGV)>$opt_f ? substr($ARGV,-$opt_f) : $ARGV);
      }
      printf("%*s %*s\t%s %-*s\n",
      $opt_n,$.,
      $opt_l,$opt_l ? (length($`)>$opt_l ? substr($`,-$opt_l):$`): "",
      $&,
      $opt_r,substr($',0,$opt_r));
    }
  }
}

# ----------------------------------------------------------

=/=/=/=/=/

Dr Daniel Robertson
Centre for English Language Teaching
University of Stirling

I have a perl script which I call "kwic" which does the job.  It's
very slow, though, and I find it best to pre-process the corpus with a
grep type utility which I call "cgrep" (written by Richard Caley,
HCRC, University of Edinburgh).  I also use a script called "kwicsort"
which sorts the output of "kwic" by left or right context.  The coding
is not very elegant and I'm sure the scripts could be made more
efficient and elegant by a perl guru but they work for me.  I append
copies of both.

cgrep
=======================================================================
#!/usr/bin/perl
# usage: cgrep [-lines] pattern [files]

$usage_msg = "Usage: cgrep [-lines] pattern [files]\n";
$context = 1;

if ($ARGV[0] eq "-h") {
    die $usage_msg;
};

# A switch to set the number of lines of context before and after

if ($ARGV[0] =~ /^-(\d+)$/) {
    $context = $1;
    shift;
}

# Get the pattern and protect the delimiter.

$pat = shift;
$pat =~ s#/#\\/#g;

# First line of input will be middle of array.
# In the eval below, it will be $ary[$context].

$_ = <>;
push(@ary,$_);

# Add blank lines before, more input after first line.

for (1 .. $context) {
    unshift(@ary,'');
    $_ = <>;
    push(@ary,$_) if $_;
}

# Now use @ary as a silo, shifting and pushing.

eval <<LOOP_END;
    while (\$ary[$context]) {
if (\$ary[$context] =~ /$pat/) {
    print "------\n" if \$seq++;
    print \@ary,"\n";
}
\$_ = <> if \$_;
shift(\@ary);
push(\@ary,\$_);
    }
LOOP_END
==================================================================

kwic
==================================================================
#!/usr/bin/perl
# usage: kwic pattern infiles
# advisable to preprocess database with cgrep

$screen = 78;             # default
$delimiter = "\n\n";      # default

$usage_msg = "Usage: kwic [-options] pattern file[s]
     pattern is any string (not a regular expression)
     options:
       -f       print file name
       -l       print line number
       -h       help
       -lob     input files are in LOB format
       -Cn      print with effective width of context n columns
       -Dchar   delimiter is char
       -Ffile   patterns in file\n";

if ($ARGV[0] =~ /\-h/) { die $usage_msg };

# process command line switches
while ($ARGV[0] =~ /^-/) {
    $_ = shift;
    if (/\-f/) {
$printfilename = 1;
    } elsif (/\-lob/) {
$lob = 1;
    } elsif (/\-l/) {
$printlinenumber = 1;
    } elsif (/\-C(\d+)/) {
$screen = $1;     # width of context
    } elsif (/\-D(\S+)/) {
$delimiter = $1;  # pattern space delimiter
    } elsif (/\-F(\w+)/) {
$patterns_in_file;
$pattern_file_name = $1;
    } else {
die "Unrecognized switch\n";
    };
};

if ($patterns_in_file) {
    open(PATTERNS, $pattern_file) || die "Can't find pattern file\n";
    while(<PATTERNS>) {
chop;
push(@patterns,$_);
    };
};

$pattern = shift(@ARGV);
@P = split(//,$pattern);  # array of all characters in pattern

$startcol = int($screen/2) - int(length($pattern)/2) ;
                          # position of first letter of pattern on screen

$* = 1;                   # enable multi-line patterns
$/ = $delimiter;          # enable paragraph mode
# $/ = "------";            # delimiter introduced by cgrep

if (!-e $ARGV[0]) {       # if no input file specified on command line
    die "No input file\n";
};

while (@ARGV) {           # with remaining argument line files
    $file = shift @ARGV;                # take next command line argument
    open(IN, $file);
    while (<IN>) {
if ($lob) {
    s/\^//g;                    # remove hats
            s/\|//g;                    # remove pipe symbols
            s/\\0//g;
            s/\*0//g;
            s/\*"/"/g;
            s/\*\*"/"/g;
            s/\*<\*\d/</g;
            s/\*>/>/g;
            s/\*//g;
            if (/^(\w\d\d)\s+(\d+)/) {
$lob_id = $1; $lob_line = $2;
    };
            s/^(\w\d\d)\s+(\d+) (.*)$/$3/g;
};
s/\n/ /g;         # replace new line with space
s/\t/ /g;                       # replace tab with space
        s/------//g;                    # remove cgrep delimiter
tr/\ //s;                       # replace multiple spaces with space
@PAR1 = split(//);              # character array of current par
        @PAR2 = @PAR1;
        $i = 0; $p = 0; $hit = "";      # initialize counters
        while (@PAR2) {                 # search string non-empty
            $ch = shift @PAR2;          # take next character in para
            $p++;                       # augment current para position
            if ($ch ne $P[$i]) {        # no match
                $i = 0;                 # re-initialize pattern counter
                $hit = "";              # re-initialize hit
            } elsif ($ch eq $P[$i]) {   # if para pos matches pattern pos
                $hit .= $ch;            # add to hit
                $i++;                   # increment pattern pos
                if ($hit eq $pattern) {
    if ($printfilename) {
                        $file =~ s/^(.*)\.txt$/\1/;
        print $file, ":";
    };
                    if ($printlinenumber) {
        #printf("%3d:", ($lob_line + 2));
printf("%3d:", $.);
    };
            $bpos = $p - $i;
            if ($bpos <= $startcol) {
                $shr = $startcol - $bpos;    # shift right
                        for ($x=1; $x<=$shr; $x++) { # print leading blanks
                            printf("%s"," ");
                        };
                        $l = $screen - $shr;         # length of remainder
                        for ($x=0; $x<=$l; $x++) {   # print para from beg
                            printf("%s",$PAR1[$x]);  # to fill up line
                        };
                        printf("\n");
                    } elsif ($bpos > $startcol) {
                        $start = $bpos - $startcol;
                        for ($x=$start; $x<=($start+$screen); $x++) {
                            printf("%s",$PAR1[$x]);
                        };
                        printf("\n");
                    };
                };
            };
        };
    };
    close(IN);
};
 ===========================================================================

kwicsort
 ===========================================================================
#!/usr/bin/perl
# usage: kwicsort -[lcr] pattern kwic.file

$usage_msg = "Usage: kwicsort -[lcr] pattern kwic.file
    pattern is any perl regular expression
    options:
       -l : sort by left context
       -c : sort by pattern
       -r : sort by right context\n";

# process options
if ($ARGV[0] =~ /\-h/) {
    die $usage_msg;
} elsif ($ARGV[0] =~ /\-l/) {
    $lsort = 1;
} elsif ($ARGV[0] =~ /\-r/) {
    $rsort = 1;
} elsif ($ARGV[0] =~ /\-c/) {
    $csort = 1;
} else { die $usage_msg };

$pattern = $ARGV[1];
$l = length($pattern);

open(IN, $ARGV[2]) || open(IN, '-') || die "Can't find input file\n";
while (<IN>) {
    chop;                            # remove new line
    $is_kwic = 0;
    $midpoint = 36;                  # half length of line
    $line{$.} = $_;                  # preserve input
    $dollarstring = '$' x $l;
  # check that pattern is centred in line
    do {
        m/$pattern/g;
if (pos $_ >= $midpoint) {
    $is_kwic = 1;            # this instance of kwic is centred
    s/($pattern)/~\1~/g;     # insert tilde round kwic
} else {                     # replace temporarily with $'s
            s/$pattern/$dollarstring/;
};
    } until $is_kwic;

    s/$dollarstring/$pattern/g;      # restore kwic
    tr/A-Z/a-z/;                     # canonicalize to lower-case
    ($left, $target, $right) = split(/~/); # split line on tildes
    if ($lsort) {                    # reverse order of words in left context
        $key = join("", reverse split(/(\S+)/, $left));
    } elsif ($rsort) {
        $key = $right;
    } elsif ($csort) {
$key = $target . $right ;
    };
    $sortkey = join('@', $key, $.);
    push(@KEYS, $sortkey);
};
close(IN);
 ===============================================================================
@SORTED = sort @KEYS;

foreach $sortkey (@SORTED) {
    ($key, $l) = split(/@/, $sortkey);
    printf("%s\n", $line{$l});
};


=/=/=/=/=/

cheers
tony.
-------------------------------------
Dr Tony Berber Sardinha
LAEL, PUC/SP
(Catholic University of Sao Paulo, Brazil)
tony4 at uol.com.br
http://lael.pucsp.br/~tony
[New website]



More information about the Corpora mailing list