Apache Pig: Macro for Splitting Data Into Training and Testing dataset

Introduction: Apache Pig (> 0.7.0) comes with a handy operator, Split, to separate a relation into two or more relations. For instance let’s say we have a website “users” data and depending on the age of a user we want to create two different datasets: kids, adults, seniors. This can be easily achieved with a single command and in a single map/reduce job using the Split operator as show below.

split users into kids if age < 18,
                 adults if age >= 18 and age < 65,
                 seniors otherwise;

Problem: However, if you are trying to randomly split the data into training and testing dataset, you can’t directly use the split operator as it cannot handle non deterministic functions (such as RANDOM). Thus the below command won’t work and will raise an error:

split data into testing if RANDOM() <= 0.10,
                training otherwise;

Solution: Below is a simple macro that uses the split operator but removes the non deterministic function issue by first assigning random values to each tuple and then filtering on those values. In order to make sure that the returned dataset has exactly the same schema as the input dataset, I am using a small trick over here. I assign random values as the first column so that in the foreach operator I can easily skip it by using numeric column reference.

-- Macro: split_into_training_testing
-- @param inputData(relation): Input dataset that needs to be
--                                  split into training and testing
-- @param split_percentage(double): Indicates the size of the testing
--                                  dataset in relation to original dataset.
--                                  split_percentage should be within 0 and 1
-- Returns two relations. The first relation contains (1-split_percentage) samples.
-- The second dataset contains split_percentage samples.
DEFINE split_into_training_testing(inputData, split_percentage)
RETURNS training, testing
    data = foreach $inputData generate RANDOM() as random_assignment, *;
    split data into testing_data if random_assignment <= $split_percentage, training_data otherwise;
    $training = foreach training_data generate $1..;
    $testing = foreach testing_data generate $1..;

-- Sample Usage
inData = load 'some_files.txt'...;
training, testing = split_into_training_testing(inData, 0.1);

About Ritesh Agrawal

I am a applied researcher who enjoys anything related to statistics, large data analysis, data mining, machine learning and data visualization.
This entry was posted in Data Mining, Hadoop and tagged . Bookmark the permalink.

3 Responses to Apache Pig: Macro for Splitting Data Into Training and Testing dataset

  1. saeed says:

    thank you very much.
    is there anay udf work for data mining like classisfication,regression like mahout?

  2. Artem Egorov says:

    Why do you just use SAMPLE function?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s