PySpark takeOrdered on Multiple Fields

In case you want to extract N records of a RDD ordered by multiple fields, you can still use takeOrdered function in pyspark. It wasn’t clear to me at first until I realized that “>”, “<“, etc functions are overloaded in python and can work with arrays and tuples.

print 3 > 2 # True
print [3] > [2] # True
print [2,1] > [2] # True
print (2,1) > (2,) # True
print (2,2) > (2,2) # False

Below is an example of how to use the above information to sort RDD based on multiple fields and extract top N records. Basically we return a tuple as the key.

# load dataset
data = sc.parallelize(...)

# Order by Col 1 in Desc Order and then Col 0 in ascending order 
topN = data.takeOrdered(10, key=lambda x: (-1 * x[1], x[0]))  

Code References:
1. takeOrdered: Note that it uses MaxHeapQ to collect elements and order them.
2. MaxHeapQ: Uses basic python comparison operator to determine the organize heap.

Posted in Big Data | Tagged | Leave a comment