Corpora: code for random selection of concordance lines

Fri Mar 22 22:51:52 UTC 2002

Rosie Jones wrote:
 > Interestingly, in practice this method is not appreciably slower. For the cited example,
 > selecting 999 lines from a 1000 line concordance, (and doing it 100 times over) doing the
 > fisher yates method on my machine takes 1 second, while the sampling-and-checking method takes
 > 3 seconds. When selecting 999 lines from a 10,000 line concordance (100 times), the
 > sampling-and-checking method takes 1 second, while the fisher yates algorithm (which after all
 > has to shuffle 10,000 lines) takes 14 seconds. Samples of up to 75% of the total concordance
 > size are faster with the sampling-and-checking method.
 >
 > So for large concordance size, with sample size of 10%-75% of concordance size, the apparently
 > more naive algorithm appears to be more efficient. For small concordance size, you probably
 > won't notice the difference in time anyway. For large concordance size and large sample size,
 > the fisher yates method is probably the best way.
 >
 > Rosie Jones.
 >
 > "There's more than one way to do it"
 >
 > On Fri, 22 Mar 2002, Tolkin, Steve wrote:
 >
 >
 >> Do NOT use this approach!  It can be pathologiclaly slow. Consider what happens in the while
 >> loop if e.g. you ask for 999 samples from a 1000 line file. Getting the last few samples can
 >> take a very long time, as you repeatedly hit lines that have already used chosen.
 >>
 >> Instead use the Fisher-Yates algorithm, described in the recent post by Alexander Clark
 >> [asc at aclark.demon.co.uk] with this same subject.
 >>
 >
 > [...]

I've used the following method, which is very close to Rosie Jones', only that it avoids the little
problem that was mentioned: (I just adapted the previous Perl script for convenience)

#!/usr/bin/perl
$numlinestoselect = shift; # get the number of lines from the command line
$myfile = "myconcordance.txt"; # could also get this from the command line
open(CONCORDANCE, $myfile) || die "Cannot open concordance file
$myfile\n";
@lines = <CONCORDANCE>; # read ALL lines into memory;
close(CONCORDANCE); # just to be tidy
shift @lines; # get rid of the first line
$totallines = scalar(@lines); # find out how many lines there are
if ($totallines < $numlinestoselect) { die "Can't select more lines than
there are\n" };
$linessampled = 0;
srand; # seed the random number generator
while ($linessampled < $numlinestoselect) {
   $rand = rand($totallines); # pick a line with uniform probability
   # *** here's the difference
   # *** switch the selected position with the last one in the list
   print $lines[$rand];
   ($lines[$rand], $lines[$totallines]) = ($lines[$totallines], $lines[$rand]);
   $totallines--;
   $linesampled++;
#  if (! $seen[$rand]) { # don't want to select the same line twice
#    print $lines[$rand];
#    $seen[$rand] = 1;
#    $linessampled++; # one more line towards our goal
#  }
}
# end of perl code

--

    --Radu Florian