Corpora: code for random selection of concordance lines

Rosie Jones rosie+ at cs.cmu.edu
Fri Mar 22 22:00:19 UTC 2002


Interestingly, in practice this method is not appreciably slower. For the
cited example, selecting 999 lines from a 1000 line concordance, (and
doing it 100 times over) doing the fisher yates method on my machine takes
1 second, while the sampling-and-checking method takes 3 seconds. When
selecting 999 lines from a 10,000 line concordance (100 times), the
sampling-and-checking method takes 1 second, while the fisher yates
algorithm (which after all has to shuffle 10,000 lines) takes 14 seconds.
Samples of up to 75% of the total concordance size are faster with the
sampling-and-checking method.

So for large concordance size, with sample size of 10%-75% of concordance
size, the apparently more naive algorithm appears to be more efficient.
For small concordance size, you probably won't notice the difference in
time anyway. For large concordance size and large sample size, the fisher
yates method is probably the best way.

Rosie Jones.

"There's more than one way to do it"

 On Fri, 22 Mar 2002, Tolkin, Steve wrote:

> Do NOT use this approach!  It can be pathologiclaly slow.
> Consider what happens in the while loop
> if e.g. you ask for 999 samples from a 1000 line file.
> Getting the last few samples can take a very long time,
> as you repeatedly hit lines that have already used chosen.
>
> Instead use the Fisher-Yates algorithm, described in the
> recent post by Alexander Clark [asc at aclark.demon.co.uk]
> with this same subject.

[...]



More information about the Corpora mailing list