51

Bootstrapping at scale in Snowflake

 4 years ago
source link: https://www.tuicool.com/articles/6naUjya
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Sometimes when statisticians are sampling data, they like to put each one back before grabbing the next. This is called random sampling with replacement, as opposed to random sampling without replacement, which is the more common thing.

An example of when this is done in the wild is during the training of a Random Forest, and this was the trigger for me writing this post.

All of the Snowflake built-in sampling functions revolve around sampling without replacement. This means you only have to make a decision about each row once — it’s either in or out according to chance.

The difficulty with bootstrapping is that it’s a bit like taking up cycling as a hobby; to do it correctly is rather expensive.

A naive, but totally accurate way to bootstrap 10,000 rows, would be:

This gives all 150,000 customers in the source table an equal chance of being picked each time, no matter how many times they’ve been selected already.

There must be a better way! The answer, of course, is that we need a “good enough” alternative. We’re sampling after all, so the level of accuracy is usually flexible if it means performance gains.

So let’s see if we can come up with a different approach.

The goal will be to bootstrap 60% of the 150,000 records in the “SNOWFLAKE_SAMPLE_DATA”.”TPCH_SF1".”CUSTOMER” table.

We’d like to do one iteration on the table (in the algorithmic sense), and therefore make a decision on each row just once in isolation if possible. We’ll need to come up with a way of selecting it potentially multiple times, but keeping within the parameters of the sample.

A simple modification, if we decided arbitrarily that ten sweeps of the table was good enough, then we could just pick 6% each time, like this:

This isn’t very good at all. For starters, we dealt with every row ten times, and we haven’t really quantified the inaccuracy at all, we just made up a number.

So let’s have a look at what happens for each row each time we do an iteration, and we’ll consider the two extremes: always getting selected, and never getting selected.

YRFRZjz.png!web

After a certain number of iterations, inevitably there’s a point where (as we’d say in Australia), each record has had a “fair go”. So we need a way of drawing that line. For the purpose of this exercise, let’s allow the user to specify two things:

  1. The desired size of the sample, as a percentage of the original table
  2. A percentage value, where we stop iterating when the probability of never being selected dips below it. Imagine this as a horizontal line somewhere on the chart above.

Equation time!

As a reminder, the goal is a 60% sample.

Forgetting the “with replacement” part for a second, the probability of each row getting selected every time:

RNBvUfB.png!web

Or of not getting selected, just flip it around:

Yrququa.png!web
1–0.6 = 0.4

The value on the right-hand side is what we want the user to specify. As you can see, it always decreases, and we will keep adding iterations until we reach it.

So we can express this generally, then solve it to figure out how many iterations we should do:

VzuQz2U.png!web

Great, now we have a formula to use.

Now, let’s get back to Snowflake. Remember, we don’t actually want to do multiple iterations. But the tricky part is, there isn’t any way to make a row appear more than once in a result set…. or is there?!?!?!

Enter UDTFs

I really like UDTFs, they have a way of enabling elegant solutions instead of awful boilerplate.

Check this one out:

We can use this to kill two birds with one stone. First of all, it will apply that formula to work out a good number of iterations (i.e. number of chances to be selected) based on that threshold we invented.

Then, it is courteous enough to divide the sample fraction by that number, and write out that number of rows.

All we need to do is join to this UDTF like so:

To work through this example, firstly the UDTF output on its own looks like this:

select * from table(SAMPLE_PROBABILITY(0.6::double,0.00001::double))

ABjENvB.png!web

The formula says we should do 13 iterations, and 0.6/13 = 0.04615384615

In other words, we will give each row 13 chances at 0.04615384615 probability per try.

So the raw result of:

SELECT customer.*
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."CUSTOMER" customer,table(SAMPLE_PROBABILITY(0.6::double,0.00001::double))

is actually 150,000 * 13 = 1,950,000 rows, but then we add the WHERE clause:

where uniform(0::float, 1::float, random()) < SAMPLE_PROBABILITY

This generates a random number between 0.0 and 1.0 and removes those rows below the threshold.

A 100% correct implementation should give us 90,000 rows (150k * 0.6). This one (with the inputs I chose) seems to yield somewhere between 89,000 and 91,000 each time.

And of course, just as we wanted, there are the chance of duplicates. You can even see one in the first page of results:


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK