In random sampling, the probability of selecting any given row is same. In other words all rows are equally weighted. As shown below random sampling can be easily achieved in Presto using TABLESAMPLE operator along with BERNOULLI
method for sampling.
WITH dataset AS ( SELECT * FROM ( VALUES (1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'A'), (6, 'B'), (7, 'C'), (9, 'A'), (10, 'B'),(11, 'C'), (13, 'A'),(14, 'B'),(15, 'C'), (17, 'A'),(18, 'B'),(19, 'C'), (21, 'A'),(22, 'B'),(23, 'C'), (25, 'A'),(26, 'B') ) AS t ) SELECT * -- assuming we want to sample 25% of records FROM dataset TABLESAMPLE BERNOULLI(25)
Since all rows are equally weighted, one of the problems with random sampling is that we might not see rare events in our sample data. For instance, above there is only record related to letter ‘D’ and most likely it won’t appear in our sampled data. This is where stratified sampling comes handy.
The idea of stratified sampling is to partition the data into different groups and then select records from each of these groups. There can be many different strategies on how many records are selected from each group. For instance below are two different strategies:
Assume the same dataset as above (containing letters and numbers), we want to select 3 rows from each letter group. Thus the sample dataset should contain 3 random records from group A, B, C
as well as the only record related to group D
.
WITH dataset AS ( SELECT * FROM ( VALUES (1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'A'), (6, 'B'), (7, 'C'), (9, 'A'), (10, 'B'),(11, 'C'), (13, 'A'),(14, 'B'),(15, 'C'), (17, 'A'),(18, 'B'),(19, 'C'), (21, 'A'),(22, 'B'),(23, 'C'), (25, 'A'),(26, 'B') ) AS t(number, letter) ) SELECT letter, number FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY letter ORDER BY rnd) as rnk FROM ( SELECT letter, number, RANDOM() AS rnd FROM dataset ) bucketed ) sampled -- assuming we want 3 records from each group WHERE rnk <= 3
The idea above is to assign a random value to each record and then select top N records in each group based on the random value.
The problem becomes little tricky if we intent to select a fixed proportion of records from each group. One naive solution might be to use two queries. First, to compute the number of records in each group and then use this information to select top N records from each group (just like in 2.1) but where N changes for each group. But this requires two passes through the data and this can be computational expensive.
However if we assume/confirm that there are more than 100 records in each group, then we can select the X% records from each group in a single pass using NTILE function as shown below.
WITH dataset AS ( SELECT * FROM ( VALUES (1,'A'),(2,'A'),...,(100,'A'), (1,'B'),(2,'B'),...,(100,'B'), (1,'C'),(2,'C'),...,(100,'C'), (1,'D'),(2,'D'),...,(100,'D') ) AS t(number, letter) ) SELECT letter, number FROM ( SELECT *, NTILE(100) OVER (PARTITION BY letter ORDER BY rnd) as tile FROM ( SELECT d.letter, number, RANDOM() AS rnd FROM dataset d ) bucketed ) sampled -- assuming we want 10% records from each letter group WHERE tile <= 10
Below is an example of how to use “phone_notification” magic function. Once the cell execution is completed, it will send an iMessage to your iPhone indicating whether the job completed successfully or if it failed. If the job failed, it will send an error message in the iMessage. You can find a sample notebook of how to use this magic function within the IPython Notebook over here).
%%phone_notification -p 1XXXXXXXXXX -m "Test Job Done" import time time.sleep(1) print "done"
Note that the magic function will only work if you are using MacBook and have an iPhone. However, it should be pretty easy to adopt for Android ecosystem. I hope you will enjoy the phone notification. Let me know what you think about it.
Both the above requests can be easily satisfied using functional programming ideas. Below is an example of an udf that converts scores (between 0 and 100) to some ordinal categories. It takes a parameter, an array of tuple defining boundary conditions for different categories.
Below we define score_to_category
function that accepts boundary condition as one of the input parameters. The function itself doesn’t do anything but returns another function (notice line 21) that can take a particular score value and returns appropriate category value.
def score_to_category(boundaries): """ Converts a numeric score into an ordinal category. :param boundaries: list of tuples specifying upper limit and category name. Eg. [(0, D), (30, C), (50, B), (80, A)] :return: a function that accepts score as a argument """ sorted_boundaries = sorted(boundaries, key=lambda x: x[0], reverse=True) def _score_to_category(score): """ Converts score to ordinal category :param value: :return: """ assert 0 <= score <= 100 for (boundary, category) in sorted_boundaries: if score >= boundary: return category return _score_to_category # Test Function. Demonstrates using function in normal python code boundaries = [(0, 'F'), (50, 'D'), (60, 'C'), (75, 'B'), (90, 'A')] converter = score_to_category(boundaries) assert converter(10) == 'F' assert converter(50) == 'D' assert converter(51) == 'D' assert converter(100) == 'A'
Also, notice there is nothing about “spark” in the above function and hence we can easily use the function in any python script. Line 27 & 28 show an example usage of the function.
To demonstrate that we can pass different boundary conditions and get different results, below I have defined two different UDFs. First UDF represents finer grained categories. Second UDF converts into Pass/Fail category.
from pyspark.sql.functions import udf boundaries = [(0, 'F'), (50, 'D'), (60, 'C'), (75, 'B'), (90, 'A')] udfScoreToFineCategories = udf(score_to_category(boundaries), StringType()) boundaries = [(0, 'Fail'), (50, 'Pass')] udfScoreToBroadCategories = udf(score_to_category(boundaries), StringType())
Now let test the UDF on some dummy data.
# Generate Random Data and convert it into spark dataframe # Generate Data import itertools import random students = ['John', 'Mike','Matt'] subjects = ['Math', 'Sci', 'Geography', 'History'] random.seed(1) data = [] for (student, subject) in itertools.product(students, subjects): data.append((student, subject, random.randint(0, 100))) # Create Schema Object from pyspark.sql.types import StructType, StructField, IntegerType, StringType schema = StructType([ StructField("student", StringType(), nullable=False), StructField("subject", StringType(), nullable=False), StructField("score", IntegerType(), nullable=False) ]) # Create DataFrame from pyspark.sql import HiveContext sqlContext = HiveContext(sc) rdd = sc.parallelize(data) df = sqlContext.createDataFrame(rdd, schema) # Apply UDFs (df .withColumn("fine_category", udfScoreToFineCategories("score")) .withColumn("broad_category", udfScoreToBroadCategories("score")) ).show(10)
Running the above code gives the following output
student | subject | score | fine_category | broad_category |
---|---|---|---|---|
John | Math | 13 | F | Fail |
John | Sci | 85 | B | Pass |
John | Geography | 77 | B | Pass |
John | History | 25 | F | Fail |
Mike | Math | 50 | F | Fail |
Mike | Sci | 45 | F | Fail |
Mike | Geography | 65 | C | Pass |
Mike | History | 79 | B | Pass |
Matt | Math | 9 | F | Fail |
Matt | Sci | 2 | F | Fail |
Below are few tips on making HiveQL DRY.
Macros allows to assign an alias to a reusable processing logic that can be expressed in SQL. In simple terms, its like defining a function purely in SQL (although it doesn’t operate that way. It will do an inline expansion but we don’t have to worry about it for now).
For instance, in the below table we have two duration fields where the duration value is expressed in different units (such as milliseconds, seconds, minutes, etc).
UUID | duration1 | duration2 |
---|---|---|
1 | 10ms | 20us |
2 | 16s | 20ms |
3 | 5m | 2us |
Below is a typical way of writing the HiveQL for this. Its bad because we have duplicated (once for each field) the logic of converting the duration expressed as string to duration in seconds. Anytime we make changes to it we will have to make sure to update the logic everywhere in the code.
SELECT UUID, CASE WHEN duration1 like '%us' THEN CAST(REPLACE(duration1, 'us', '') AS DOUBLE) / 1.0E6 WHEN duration1 like '%ms' THEN CAST(REPLACE(duration1, 'ms', '') AS DOUBLE) / 1000.0 WHEN duration1 like '%s' THEN CAST(REPLACE(duration1, 's', '') AS DOUBLE) WHEN duration1 like '%m' THEN CAST(REPLACE(duration1, 'm', '') AS DOUBLE) * 60 ELSE NULL END as duration1_seconds, CASE WHEN duration2 like '%us' THEN CAST(REPLACE(duration2, 'us', '') AS DOUBLE) / 1.0E6 WHEN duration2 like '%ms' THEN CAST(REPLACE(duration2, 'ms', '') AS DOUBLE) / 1000.0 WHEN duration2 like '%s' THEN CAST(REPLACE(duration2, 's', '') AS DOUBLE) WHEN duration2 like '%m' THEN CAST(REPLACE(duration2, 'm', '') AS DOUBLE) * 60 ELSE NULL END as duration2_seconds FROM ( SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2 UNION ALL SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2 UNION ALL SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2 ) A
DRY way to rewrite the above query is to utilize “macro”. We first define a macro “DURATION_IN_SECONDS” and use it convert all the duration fields as shown below.
-- define macro to convert duration string to duration in seconds CREATE TEMPORARY MACRO DURATION_IN_SECONDS (t string) CASE WHEN t like '%us' THEN CAST(REPLACE(t, 'us', '') AS DOUBLE) / 1.0E6 WHEN t like '%ms' THEN CAST(REPLACE(t, 'ms', '') AS DOUBLE) / 1000.0 WHEN t like '%s' THEN CAST(REPLACE(t, 's', '') AS DOUBLE) WHEN t like '%m' THEN CAST(REPLACE(t, 'm', '') AS DOUBLE) * 60 ELSE NULL END; SELECT UUID, -- use macro to convert first duration field DURATION_IN_SECONDS(duration1) duration1_seconds, -- use macro to convert second duration field DURATION_IN_SECONDS(duration2) duration2_seconds FROM ( SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2 UNION ALL SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2 UNION ALL SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2 ) A
Below is an example of another typical query. In the SQL below, we use tableC to filter tableA and tableB and then join the two together. The logic on how to filter tableC itself has been duplicated.
SELECT * FROM ( SELECT TableA.* FROM TableA JOIN TableC ON (TableA.id = TableC.id) WHERE TableA.datestr >= '2017-01-01' -- filters on table C AND TableC.datestr >= '2017-01-01' AND TableC.status != 0 ) A JOIN ( SELECT TableB.* FROM TableB JOIN TableC ON (TableB.id = TableC.id) WHERE TableB.datestr >= '2017-01-01' -- filters on table C AND TableC.datestr >= '2017-01-01' AND TableC.status != 0 ) B On (A.id = B.id)
Here, using “With” clause can help us make this query DRY. We first express the logic of filtering table C and assign it an alias. Next we join tableA and tableB to this alias.
-- express logic to filter table C over here WITH FilteredTableC AS ( SELECT * FROM TableC WHERE datestr >= '2017-01-01' AND status != 0 ) SELECT * FROM ( SELECT TableA.* FROM TableA JOIN FilteredTableC ON (TableA.id = FilteredTableC.id) WHERE TableA.datestr >= '2017-01-01' ) A JOIN ( SELECT TableB.* FROM TableB JOIN FilteredTableC ON (TableB.id = FilteredTableC.id) WHERE TableB.datestr >= '2017-01-01' ) B On (A.id = B.id)
“With” Clause not only helps with making a SQL DRY, but is also very useful in breaking a big sql involving many joins into smaller easy self summarizing chunks. For instance below is an example of a query that joins three tables together. Even in this simple query it becomes difficult to understand the goal as there is a list of filters that we are applying to different tables.
SELECT drivers.*, riders.* FROM trips JOIN drivers ON drivers.driver_id = trips.driver_id JOIN riders ON riders.rider_id = trips.rider_id WHERE trips.datestr >= '2017-01-01' AND trips.status = 0 AND trips.city = 'SF' AND drivers.joined >= '2017-01-01' AND drivers.status = 'active' AND riders.joined >= '2017-01-01' AND riders.name like 'XYZ%'
Using “With Clause” allows to rewrite the above query in much more legible way. Each table is separately filtered and assigned a readable alias which is then used in the main query.
WITH SuccessfulTrips as ( SELECT * FROM trips WHERE trips.status = 0 AND trips.datestr >= '2017-01-01' ), ActiveDrivers as ( SELECT * FROM drivers WHERE drivers.status = 'active' AND drivers.joined >= '2017-01-01' ), XYZRiders as ( SELECT * FROM riders WHERE riders.name like 'XYZ%' AND riders.joined >= '2017-01-01' ) SELECT ActiveDrivers.*, XYZRiders.* FROM SuccessfulTrips JOIN ActiveDrivers ON (SuccessfulTrips.driver_id = ActiveDrivers.driver_id) JOIN XYZRiders ON (SuccessfulTrips.rider_id = XYZRiders.rider_id)
Often we use same constant values in multiple places. Instead of copying these constant values all over the place we can easily define a variable and use the variable.
SET start_date = '2017-01-01'; SET end_date = '2017-05-01'; SELECT A.*, B.* FROM A.* JOIN B.* ON (A.id = B.id) WHERE A.datestr >= ${hiveconf:start_date} AND A.datestr <= ${hiveconf:end_date} AND B.datestr >= ${hiveconf:start_date} AND B.datestr <= ${hiveconf:end_date}
There are few different options for setting variables in hive. Make sure to read comments on this stackoverflow post
Let’s assume you got a model that can predict house prices. Naturally you won’t trust it unless you evaluate it and establish some confidence on expected error. So, to start with you feed in features (such as room number, lot size, etc) for a certain house and compare the predicted (say 130K) to its actual (say 120K) price. In this particular case we can say that the model over estimated the price by 10K. But a single point is not sufficient to make a general claim about the accuracy or expected error for the given model. So we feed in features for another 1000 houses and for each of them compute error, i.e. difference between predicted and actual price).
From descriptive statistics we know that there are different ways to summarize these 1000 error points. For instance we can summarize the general tendency of the dataset by mean or median or even draw a boxplot to understand the distribution of error.
Since we are interested in a numerical measure (rather than visualization), using “mean” as a way to summarize all the observed error make sense. Thus we can compute mean error.
However there is a problem. What if the error is -10K (i.e under-estimates) for one house and 10K (i.e. over-estimates) for another. Then mean error will be 0. Intuitively this doesn’t make sense. It makes more sense to say that the expected error is 10K i.e. we operate on absolute error rather than on signed (under/over estimate) error. Thus we got all the components of our first metric, namely Mean Absolute Error. To summarize, its called mean absolute error because:
Now, we know that mean is sensitive to outliers. So sometimes instead of mean we use median and the metric is known as median absolute error. The advantage of “Mean/Median Absolute Error” is that its easy to make sense of the number. For instance if the mean absolute error of a model is 20K then we know that if the predicted price is 200K then the actual price is most likely between 180K to 220K.
Data scientists are not only concerned with quantifying the error but are also interested in determining if the model can be improved. To answer this question let’s first establish the best and the worst models.
Best Model
Theoretically, the best model is a model for which the absolute error is zero for all the test cases. As shown in the graph below, if we draw absolute error on x-axis and cumulative percentage of houses on y-axis then a point say (50K, 0.6) indicates that for 60% of houses the absolute error is less than or equal to 50K.
So given this graph how the best model will look like ?
Since absolute error is always zero, the graph will be simply a vertical line starting from 0 on x-axis extending to 100% on y-axis.
Worst Model
Don’t confuse the word “worst” with the word “dump”. Typically for building a regression model we have a target variable (house price) and certain features or predictor variables such as number of rooms, lot size, etc. But what if there are no features available. For instance, the only information provided is house prices for 10K randomly selected houses. We can still build a model simply based on this limited information. For instance we can compute mean house price based on the 10K training samples we have. Now our model will simply return this mean value. Let’s say the mean value is 215K. If we ask this model what will be the price of a house with lot size 5000 sq ft, it will simply return 215K. Let’s call this mean model.
Theoretically it can be shown that when no other information is available mean model will minimize error. Intuitively this makes sense as we often tend to use mean value when we have no other information. The graph below indicates how the curve for the mean model will look like.
Determining scope for improvement
From the above graph, we can easily observe few things. First, as our model becomes better, it will move towards the best model and hence the area between the best model and our model will decrease. On the other hand the area between the worst model and our model will increase. However the total area i.e area between the best and the worst model remains same. Let’s call this area to be the improvement opportunity. As our model get’s better, the more of this improvement opportunity area it covers. This is exactly what R2 metric captures. It indicates what portion of the total improvement opportunity our model covers i.e.
Once we understand the above intuition, its also easy to understand why often there is a confusion of whether R2 ranges from 0 to 1 (as mentioned in wikipedia) or from -1 to 1 (as in sklearn library). If we go by formula 1 in the above graph then R2 will be always positive and between 0 and 1. However this doesn’t tell where our model is in comparison to the mean model. Implicitly it’s made an assumption that our model will be always better than mean model and hence will be in between mean model and the best model.
But in practices its possible that our model is worst than mean model and it falls on right side of the mean model. In that case will be bigger than and hence R2 will be negative.
I hope now we can appreciate the beauty of R2 and understand the intuition behind it.
Luckily using IPython Notebook you can have the goodness of both the worlds. Ipython notebook (especially rpy2 package) allows to seamlessly transfer objects between python and R environment. Below is a brief explanation and code snippet of how data generated/processed in python can be visualized using R’s ggplot. [In hurry ! Sample notebook over here.]
Step 1: Load R Kenel within IPython using rpy2 package
Previously communication between R and Ipython notebook was handled by rmagic extension. Now most of this logic has been abstracted into its own python package known as rpy2. You can install rpy2 using the following command: pip install rpy2 --upgrade
. Once rpy2 is installed, you can initialize R kernel within IPython Notebook using rpy2.ipython
extension as shown below.
%load_ext rpy2.ipython
Step 2: Convert Data To Pandas Dataframe
If you already have some data available as pandas dataframe then feel free to use that data in the next step (and skip this step). If not, let’s randomly select 1000 points from normal distribution using numpy numpy and finally convert it to pandas dataframe. In the next step we will pass this dataframe to R’s ggplot library and plot the density curve.
import pandas as pd import numpy as np data = np.random.randn(5000, 1) df = pd.DataFrame(data, columns=["value"])
Step 3: Using %%R cell magic function
Finally use %%R cell magic function and pass df
(python object pointing to pandas dataframe) using -i
parameter. rpy2 package will make it available within R’s environment by applying necessary transformations. Now we can do anything to this data, including visualizing using R’s ggplot library.
%%R -i df -w 800 -h 480 -u px library(ggplot2) ggplot(df) + geom_density(aes(x=value))
Below is the list of some of the important parameters that can be passed to %%R magic function:
Full documentation on parameters can be found over here.
Reference:
1. Revolution Analytics’ Blog On Using R With Jupyter Notebook
2. Stack Overflow
%run -i
: Running another notebook in the context of current python kernelThere are always few classes/functions that you want to use across different notebooks. You can keep these common functions in a notebook (say common.ipynb) and run it in the context of existing notebook by invoking the following command in your existing notebook.
%run -i common.ipynb
Ever wondered how much more time your iterator will take to complete. There are multiple ways to easily include a progress bar along with your iterators.
Option 1: Using inbuilt IPython utilities
from ipywidgets import FloatProgress from IPython.display import display MAX_VALUE = 100 f = FloatProgress(min=0, max=MAX_VALUE) for i in xrange(MAX_VALUE): sleep(0.1) f.value = i # increment value of the progress bar
Option 2: Using tqdm library
I prefer tqdm as it makes adding a progress bar a breeze.
from tqdm import tqdm from time import sleep for i in tqdm(range(100)): sleep(0.1)
Below is a simple example of showing how to write unit test in the ipython notebook. The main lines are the two bottom most lines where we load the unit test suite and run it.
import unittest # Define Person class class Person(object): def __init__(self, name, age): self.__name = name self.__age = age @property def name(self): return self.__name @property def age(self): return self.__age def __str__(self): return "{} ({})".format(self.name, self.age) def __eq__(self, other): return self.name == other.name and self.age == other.age # Define unit test class PersonTest(unittest.TestCase): def test_initialization(self): p1 = Person("xyz", 10) self.assertEqual("xyz", p1.name) self.assertEqual(10, p1.age) def test_equality(self): p1 = Person("xyz", 10) p2 = Person("xyz", 10) self.assertEqual(p1, p2) ## 4. Run unit test suite = unittest.TestLoader().loadTestsFromTestCase( PersonTest ) unittest.TextTestRunner().run(suite)
I love R’s ggplot package for visualizing data. Luckily using Ipython notebook I can do all the data processing in python and visualize it using R’s ggplot. Checkout more about over here
Of all the text editors, I find Sublime to be the best. It offers multiple functionality but the best is multicursor editing. Interestingly you can have the same editing capability within your ipython notebook. As mentioned over here, you need to add some code to your custom.js file. If you are missing custom.js file, then follow this link and complete the “hello world…” exercise to make sure that custom.js is being loaded properly.
where
Covariance matrix for a dataset with independent feature is a diagonal matrix. For a diagonal matrix we can easily show that
Using the above two properties of the diagonal matrix we can show that equation 1 essentially same as equation 2 when features are independent. Let’s first tackle in equation 1. Since determinant of a diagonal matrix is equal to the product of diagonal elements we can rewrite
Now let’s focus on the exponential part in equation 1. Using 3, we can show that
Now can be written as . Thus
Replacing 1 with 5 and 7 we get
Hence proved.
In order to write a custom UDAF you need to extend UserDefinedAggregateFunctions and define following four methods:
initialize
— On a given node, this method is called once for each group.update
— For a given group, spark will call “update” for each input record of that group.merge
— if the function supports partial aggregates, spark might (as an optimization) compute partial result and combine them togetherevaluate
— Once all the entries for a group are exhausted, spark will call evaluate to get the final result.Depending on whether the function supports combiner option or not, the order of execution can vary in the following two ways:
if the function supports partial aggregates
You can read more about the execution pattern in my earlier blog on custom UDAF in hive.
Apart from defining the above four methods you also need to define input, intermediate and final datatype. Below is a example showing how to write a custom function that computes mean.
package com.myuadfs import org.apache.spark.sql.Row import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ /** * Created by ragrawal on 9/23/15. * Computes Mean */ //Extend UserDefinedAggregateFunction to write custom aggregate function //You can also specify any constructor arguments. For instance you //can have CustomMean(arg1: Int, arg2: String) class CustomMean() extends UserDefinedAggregateFunction { // Input Data Type Schema def inputSchema: StructType = StructType(Array(StructField("item", DoubleType))) // Intermediate Schema def bufferSchema = StructType(Array( StructField("sum", DoubleType), StructField("cnt", LongType) )) // Returned Data Type . def dataType: DataType = DoubleType // Self-explaining def deterministic = true // This function is called whenever key changes def initialize(buffer: MutableAggregationBuffer) = { buffer(0) = 0.toDouble // set sum to zero buffer(1) = 0L // set number of items to 0 } // Iterate over each entry of a group def update(buffer: MutableAggregationBuffer, input: Row) = { buffer(0) = buffer.getDouble(0) + input.getDouble(0) buffer(1) = buffer.getLong(1) + 1 } // Merge two partial aggregates def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = { buffer1(0) = buffer1.getDouble(0) + buffer2.getDouble(0) buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1) } // Called after all the entries are exhausted. def evaluate(buffer: Row) = { buffer.getDouble(0)/buffer.getLong(1).toDouble } }
Below is the code that shows how to use UDAF with dataframe.
import org.apache.spark.sql.Row import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType} import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.functions._ import com.myudafs.CustomMean // define UDAF val customMean = new CustomMean() // create test dataset val data = (1 to 1000).map{x:Int => x match { case t if t <= 500 => Row("A", t.toDouble) case t => Row("B", t.toDouble) }} // create schema of the test dataset val schema = StructType(Array( StructField("key", StringType), StructField("value", DoubleType) )) // construct data frame val rdd = sc.parallelize(data) val df = sqlContext.createDataFrame(rdd, schema) // Calculate average value for each group df.groupBy("key").agg( customMean(df.col("value")).as("custom_mean"), avg("value").as("avg") ).show()
Output should be
key | custom_mean | avg |
---|---|---|
A | 250.5 | 250.5 |
B | 750.5 | 750.5 |
— | —– | —– |
Few shortcomings of the UserDefinedAggregateFunction class
As a motivating example assume we are given some student data containing student’s name, subject and score and we want to convert numerical score into ordinal categories based on the following logic:
Below is the relevant python code if you are using pyspark.
# Generate Random Data import itertools import random students = ['John', 'Mike','Matt'] subjects = ['Math', 'Sci', 'Geography', 'History'] random.seed(1) data = [] for (student, subject) in itertools.product(students, subjects): data.append((student, subject, random.randint(0, 100))) # Create Schema Object from pyspark.sql.types import StructType, StructField, IntegerType, StringType schema = StructType([ StructField("student", StringType(), nullable=False), StructField("subject", StringType(), nullable=False), StructField("score", IntegerType(), nullable=False) ]) # Create DataFrame from pyspark.sql import HiveContext sqlContext = HiveContext(sc) rdd = sc.parallelize(data) df = sqlContext.createDataFrame(rdd, schema) # Define udf from pyspark.sql.functions import udf def scoreToCategory(score): if score >= 80: return 'A' elif score >= 60: return 'B' elif score >= 35: return 'C' else: return 'D' udfScoreToCategory=udf(scoreToCategory, StringType()) df.withColumn("category", udfScoreToCategory("score")).show(10)
Line 2-10 is the basic python stuff. We are generating a random dataset that looks something like this:
student | subject | score |
---|---|---|
John | Math | 13 |
… | … | … |
Mike | Sci | 45 |
Mike | Geography | 65 |
… | … | … |
Next line 12-24 are dealing with constructing the dataframe. The main part of the code is in line 27-34. We first define our function in a normal python way.
Below is scala example of the same:
// Construct Dummy Data import util.Random import org.apache.spark.sql.Row implicit class Crossable[X](xs: Traversable[X]) { def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x, y) } val students = Seq("John", "Mike","Matt") val subjects = Seq("Math", "Sci", "Geography", "History") val random = new Random(1) val data =(students cross subjects).map{x => Row(x._1, x._2,random.nextInt(100))}.toSeq // Create Schema Object import org.apache.spark.sql.types.{StructType, StructField, IntegerType, StringType} val schema = StructType(Array( StructField("student", StringType, nullable=false), StructField("subject", StringType, nullable=false), StructField("score", IntegerType, nullable=false) )) // Create DataFrame import org.apache.spark.sql.hive.HiveContext val rdd = sc.parallelize(data) val df = sqlContext.createDataFrame(rdd, schema) // Define udf import org.apache.spark.sql.functions.udf def udfScoreToCategory=udf((score: Int) => { score match { case t if t >= 80 => "A" case t if t >= 60 => "B" case t if t >= 35 => "C" case _ => "D" }}) df.withColumn("category", udfScoreToCategory(df("score"))).show(10)