Corpora: code for random selection of concordance lines

Tolkin, Steve Steve.Tolkin at FMR.COM
Fri Mar 22 13:52:34 UTC 2002


Do NOT use this approach!  It can be pathologiclaly slow.
Consider what happens in the while loop
if e.g. you ask for 999 samples from a 1000 line file.
Getting the last few samples can take a very long time,
as you repeatedly hit lines that have already used chosen.

Instead use the Fisher-Yates algorithm, described in the
recent post by Alexander Clark [asc at aclark.demon.co.uk]
with this same subject.

Note that Fisher-Yates is not "complete", in that there
are many possible shuffles that are never returned.
It is "fair", in that all the
results have an equal probablility of being chosen.
Search for "fisher-yates perl abigail"
for more details, and/or see
http://www.bumppo.net/lists/fun-with-perl/2000/07/msg00016.html

Hopefully helpfully yours,
Steve
--
Steven Tolkin          steve.tolkin at fmr.com      617-563-0516
Fidelity Investments   82 Devonshire St. V8D     Boston MA 02109
There is nothing so practical as a good theory.  Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.

> -----Original Message-----
> From: Rosie Jones [mailto:rosie+ at cs.cmu.edu]
> Sent: Thursday, March 21, 2002 2:56 PM
> To: Tony Berber Sardinha
> Cc: corpora list - messages to list
> Subject: Re: Corpora: code for random selection of concordance lines
>
>
> On Thu, 21 Mar 2002, Tony Berber Sardinha wrote:
> > I wonder if anyone has a bit of perl or java code (or unix
> utilities)
> > for drawing an x number of lines at random from a concordance?
> [...]
>
> I was going to post this as a private reply, then remembered
> that I began
> programming in perl after someone sent a snippet of perl for
> word-counting
> to the corpora list a number of years ago, and thought
> someone else might
> benefit in the same way...
>
> Assuming the concordance is small enough to fit in memory,
> the following
> code should work (though admittedly not tested with DOS line-breaks):
>
> --- begin perl code
> #!/usr/bin/perl
> $numlinestoselect = shift; # get the number of lines from the
> command line
> $myfile = "myconcordance.txt"; # could also get this from the
> command line
> open(CONCORDANCE, $myfile) || die "Cannot open concordance file
> $myfile\n";
> @lines = <CONCORDANCE>; # read ALL lines into memory;
> close(CONCORDANCE); # just to be tidy
> shift @lines; # get rid of the first line
> $totallines = scalar(@lines); # find out how many lines there are
> if ($totallines < $numlinestoselect) { die "Can't select more
> lines than
> there are\n" };
> $linessampled = 0;
> srand; # seed the random number generator
> while ($linessampled < $numlinestoselect) {
>   $rand = rand($totallines); # pick a line with uniform probability
>   if (! $seen[$rand]) { # don't want to select the same line twice
>     print $lines[$rand];
>     $seen[$rand] = 1;
>     $linessampled++; # one more line towards our goal
>   }
> }
> # end of perl code
> ---
>
> Rosie Jones
> PhD student
> School of Computer Science
> Carnegie Mellon University
> rosie at cs.cmu.edu http://www.cs.cmu.edu/~rosie/
>
>
>



More information about the Corpora mailing list