Corpora: Summary - code for random selection of concordance lines

Fri Mar 22 20:32:30 UTC 2002

Dear list members

Thanks to everyone who so kindly responded to my query:

Alexander Clark
David Graff
Rosie Jones
Adam Kilgariff
Bruce Lambert
Steve Tolkin

Summary of replies follows:

================

Alexander Clark:

shuffle.pl < file | head -n

#!/usr/bin/perl -w
# shuffle the lines at random
# Using Fisher-Yates algorithm

use strict;
@lines = (<>);
for ($i = @lines; --$i;){
    $j = int rand($i+1);
    ($lines[$i],  $lines[$j]) = ($lines[$j],  $lines[$i]);
}
print @lines;

=========

David Graff

(number of lines set to 20 :)

    $ tail +1 conc | perl -pe '$r=rand(); s/^/$r /;' | sort -n | head -20 |
cut -f2- "-d "

===========

Rosie Jones

#!/usr/bin/perl
$numlinestoselect = shift; # get the number of lines from the command line
$myfile = "myconcordance.txt"; # could also get this from the command line
open(CONCORDANCE, $myfile) || die "Cannot open concordance file
$myfile\n";
@lines = <CONCORDANCE>; # read ALL lines into memory;
close(CONCORDANCE); # just to be tidy
shift @lines; # get rid of the first line
$totallines = scalar(@lines); # find out how many lines there are
if ($totallines < $numlinestoselect) { die "Can't select more lines than
there are\n" };
$linessampled = 0;
srand; # seed the random number generator
while ($linessampled < $numlinestoselect) {
  $rand = rand($totallines); # pick a line with uniform probability
  if (! $seen[$rand]) { # don't want to select the same line twice
    print $lines[$rand];
    $seen[$rand] = 1;
    $linessampled++; # one more line towards our goal
  }
}
# end of perl code

=============

Adam Kilgariff

(number of lines set to 100 :)

#!/usr/local/bin/perl

$numwanted=100;
@rand = sort map(rand(1)." $_", <>);
for (@rand){
        $x++;
        s/^0.[0-9]+ //;
        print;
exit if $x==$numwanted;
    }

============

Bruce Lambert

#!/bin/sh

IFILE="$1"
N="$2"

gawk 'BEGIN {srand()} {print rand(),$0}' $IFILE | sort | gawk
'{$1="";print}'   | head -$N

On a Unix system that has gawk: Copy this into a file called 'randomize'.
At the prompt (~>) type:

~> chmod +x randomize

then

~> randomize some_input_file N > some_output_file

N is the number or lines desired in the output. If your system does not
have gawk, you can download and install it or try awk (you'll need to
change gawk to awk in the script).

============

Steve Tolkin

(with reference to Rosie Jones's reply)

Do NOT use this approach!  It can be pathologiclaly slow.
Consider what happens in the while loop
if e.g. you ask for 999 samples from a 1000 line file.
Getting the last few samples can take a very long time,
as you repeatedly hit lines that have already used chosen.

Instead use the Fisher-Yates algorithm, described in the
recent post by Alexander Clark [asc at aclark.demon.co.uk]
with this same subject.

Note that Fisher-Yates is not "complete", in that there
are many possible shuffles that are never returned.
It is "fair", in that all the
results have an equal probablility of being chosen.
Search for "fisher-yates perl abigail"
for more details, and/or see
http://www.bumppo.net/lists/fun-with-perl/2000/07/msg00016.html

================

Thanks again to all who took the time to reply.

cheers
tony.
-------------------------------------
Dr Tony Berber Sardinha
LAEL, PUC/SP
(Catholic University of Sao Paulo, Brazil)
tony4 at uol.com.br
http://lael.pucsp.br/~tony
[New website]