reservoir sampling

I learned a nice algorithm for choosing a random, evenly distributed sample along a stream of unknown size, or one that you don't have random access to. It originally showed up as a question from Amazon on careercup.com.

The näive algorithm (and the only solution i could come up with on my own) requires two passes, one to get the number of elements n and a second to pluck the k random elements from the stream.

The best explanation I could find was from this blog: http://gregable.com/2007/10/reservoir-sampling.html. Begin by filling the output buffer with the first k items of the stream. Then iterate through the entire stream, and replace a random element in the output buffer with the current element at a probability of k/s where s is the index of the current element beginning from 1.

Here's the implementation in python:


	        

Leave a comment -

Elliot

https://www.emikek.com

I'm a .NET, PHP, Python, and Javascript developer. Also Project Manager, Scrum Master, and Juilliard trained violist. You cannot connect with me on Facebook, Twitter, or LinkedIn so don't ask.