In case you want to extract N records of a RDD ordered by multiple fields, you can still use takeOrdered function in pyspark. It wasn’t clear to me at first until I realized that “>”, “<“, etc functions are overloaded in python and can work with arrays and tuples.
print 3 > 2 # True print  >  # True print [2,1] >  # True print (2,1) > (2,) # True print (2,2) > (2,2) # False
Below is an example of how to use the above information to sort RDD based on multiple fields and extract top N records. Basically we return a tuple as the key.
# load dataset data = sc.parallelize(...) # Order by Col 1 in Desc Order and then Col 0 in ascending order topN = data.takeOrdered(10, key=lambda x: (-1 * x, x))