I learned a nice algorithm for choosing a random, evenly distributed sample along a stream of unknown size, or one that you don't have random access to. It originally showed up as a question from Amazon on careercup.com.
The näive algorithm (and the only solution i could come up with on my own) requires two passes, one to get the number of elements n and a second to pluck the k random elements from the stream.
The best explanation I could find was from this blog: http://gregable.com/2007/10/reservoir-sampling.html. Begin by filling the output buffer with the first k items of the stream. Then iterate through the entire stream, and replace a random element in the output buffer with the current element at a probability of k/s where s is the index of the current element beginning from 1.
Here's the implementation in python:
Leave a comment -