Corpora: code for random selection of concordance lines

Rosie Jones rosie+ at cs.cmu.edu
Thu Mar 21 19:56:10 UTC 2002


On Thu, 21 Mar 2002, Tony Berber Sardinha wrote:
> I wonder if anyone has a bit of perl or java code (or unix utilities)
> for drawing an x number of lines at random from a concordance?
[...]

I was going to post this as a private reply, then remembered that I began
programming in perl after someone sent a snippet of perl for word-counting
to the corpora list a number of years ago, and thought someone else might
benefit in the same way...

Assuming the concordance is small enough to fit in memory, the following
code should work (though admittedly not tested with DOS line-breaks):

--- begin perl code
#!/usr/bin/perl
$numlinestoselect = shift; # get the number of lines from the command line
$myfile = "myconcordance.txt"; # could also get this from the command line
open(CONCORDANCE, $myfile) || die "Cannot open concordance file
$myfile\n";
@lines = <CONCORDANCE>; # read ALL lines into memory;
close(CONCORDANCE); # just to be tidy
shift @lines; # get rid of the first line
$totallines = scalar(@lines); # find out how many lines there are
if ($totallines < $numlinestoselect) { die "Can't select more lines than
there are\n" };
$linessampled = 0;
srand; # seed the random number generator
while ($linessampled < $numlinestoselect) {
  $rand = rand($totallines); # pick a line with uniform probability
  if (! $seen[$rand]) { # don't want to select the same line twice
    print $lines[$rand];
    $seen[$rand] = 1;
    $linessampled++; # one more line towards our goal
  }
}
# end of perl code
---

Rosie Jones
PhD student
School of Computer Science
Carnegie Mellon University
rosie at cs.cmu.edu http://www.cs.cmu.edu/~rosie/



More information about the Corpora mailing list