While analyzing an experiment data, I encountered an interesting brain teaser. I wanted to use the bootstrap method and for that needed to sample my data with replacement. After some iterations, I got the perfect way to do it. The trick is using UNNEST with sequence operation to duplicate the data. Below is an example.
In the below example, I assume we have two different treatment groups (A and B), and we need 3 samples of data. In each sample, we need 40% of users i.e. (2 of 5 users from each treatment group).
The most important step is
UNNEST("sequence"(0, 3, 1)) T1(x) which duplicates each user 3 times. Then we use NTILE operator to split the data into 5 groups and thereafter select two buckets for each group. Since there are only 5 users per treatment group, each tile will only have one user. If you have more than 100 users, then set the number of tiles to 100 and replace tile < 3 with tile < 40 to select 40% of users. You can read more about how to use NTILE operator for sampling in my earlier blog over here
WITH dataset AS ( SELECT * FROM ( VALUES (1, 'A'), (2, 'A'), (3, 'A'), (4, 'A'), (5, 'A'), (6, 'B'), (7, 'B'), (8, 'B'), (9, 'B'), (10, 'B') ) AS t(user_uuid, treatment_group) ) SELECT user_uuid , treatment_group , iterNo FROM ( SELECT user_uuid , treatment_group , iterNo , NTILE(5) OVER (PARTITION BY treatment_group, iterNo ORDER BY rnd) as tile FROM ( SELECT user_uuid , treatment_group , x as iterNo , RAND() as rnd FROM dataset CROSS JOIN UNNEST("sequence"(0, 3, 1)) T1(x) ) A ) B WHERE tile < 3