Sampling With Replacement In Presto

While analyzing an experiment data, I encountered an interesting brain teaser. I wanted to use the bootstrap method and for that needed to sample my data with replacement. After some iterations, I got the perfect way to do it. The trick is using UNNEST with sequence operation to duplicate the data. Below is an example.

In the below example, I assume we have two different treatment groups (A and B), and we need 3 samples of data. In each sample, we need 40% of users i.e. (2 of 5 users from each treatment group).

The most important step is UNNEST("sequence"(0, 3, 1)) T1(x) which duplicates each user 3 times. Then we use NTILE operator to split the data into 5 groups and thereafter select two buckets for each group. Since there are only 5 users per treatment group, each tile will only have one user. If you have more than 100 users, then set the number of tiles to 100 and replace tile < 3 with tile < 40 to select 40% of users. You can read more about how to use NTILE operator for sampling in my earlier blog over here

WITH dataset AS (
    SELECT *
    FROM (
        VALUES
            (1, 'A'), 
            (2, 'A'), 
            (3, 'A'),
            (4, 'A'),
            (5, 'A'),            
            (6, 'B'),
            (7, 'B'),
            (8, 'B'),            
            (9, 'B'),            
            (10, 'B')           
    ) AS t(user_uuid, treatment_group)
)

SELECT
    user_uuid
    , treatment_group 
    , iterNo    
FROM 
(
    SELECT 
        user_uuid
        , treatment_group 
        , iterNo
        , NTILE(5) OVER (PARTITION BY treatment_group, iterNo ORDER BY rnd) as tile
    FROM 
    (
        SELECT 
            user_uuid
            , treatment_group
            , x as iterNo
            , RAND() as rnd 
        FROM dataset
        CROSS JOIN UNNEST("sequence"(0, 3, 1)) T1(x)
    ) A 
) B 
WHERE tile < 3