Corpora: code for random selection of concordance lines
Radu Florian
rflorian at cs.jhu.edu
Fri Mar 22 22:51:52 UTC 2002
Rosie Jones wrote:
> Interestingly, in practice this method is not appreciably slower. For the cited example,
> selecting 999 lines from a 1000 line concordance, (and doing it 100 times over) doing the
> fisher yates method on my machine takes 1 second, while the sampling-and-checking method takes
> 3 seconds. When selecting 999 lines from a 10,000 line concordance (100 times), the
> sampling-and-checking method takes 1 second, while the fisher yates algorithm (which after all
> has to shuffle 10,000 lines) takes 14 seconds. Samples of up to 75% of the total concordance
> size are faster with the sampling-and-checking method.
>
> So for large concordance size, with sample size of 10%-75% of concordance size, the apparently
> more naive algorithm appears to be more efficient. For small concordance size, you probably
> won't notice the difference in time anyway. For large concordance size and large sample size,
> the fisher yates method is probably the best way.
>
> Rosie Jones.
>
> "There's more than one way to do it"
>
> On Fri, 22 Mar 2002, Tolkin, Steve wrote:
>
>
>> Do NOT use this approach! It can be pathologiclaly slow. Consider what happens in the while
>> loop if e.g. you ask for 999 samples from a 1000 line file. Getting the last few samples can
>> take a very long time, as you repeatedly hit lines that have already used chosen.
>>
>> Instead use the Fisher-Yates algorithm, described in the recent post by Alexander Clark
>> [asc at aclark.demon.co.uk] with this same subject.
>>
>
> [...]
I've used the following method, which is very close to Rosie Jones', only that it avoids the little
problem that was mentioned: (I just adapted the previous Perl script for convenience)
#!/usr/bin/perl
$numlinestoselect = shift; # get the number of lines from the command line
$myfile = "myconcordance.txt"; # could also get this from the command line
open(CONCORDANCE, $myfile) || die "Cannot open concordance file
$myfile\n";
@lines = <CONCORDANCE>; # read ALL lines into memory;
close(CONCORDANCE); # just to be tidy
shift @lines; # get rid of the first line
$totallines = scalar(@lines); # find out how many lines there are
if ($totallines < $numlinestoselect) { die "Can't select more lines than
there are\n" };
$linessampled = 0;
srand; # seed the random number generator
while ($linessampled < $numlinestoselect) {
$rand = rand($totallines); # pick a line with uniform probability
# *** here's the difference
# *** switch the selected position with the last one in the list
print $lines[$rand];
($lines[$rand], $lines[$totallines]) = ($lines[$totallines], $lines[$rand]);
$totallines--;
$linesampled++;
# if (! $seen[$rand]) { # don't want to select the same line twice
# print $lines[$rand];
# $seen[$rand] = 1;
# $linessampled++; # one more line towards our goal
# }
}
# end of perl code
--
--Radu Florian
More information about the Corpora
mailing list