Detecting Gender Bias Per Movie Genre Using Hive

Recently I have been playing with the MovieLens dataset and encountered an interesting challenge. MovieLens data contains movie rating information. Additionally for each movie we know the genre it belongs to and for each rater we have gender information. Based on the given data, I got intrigued to see if there are any gender bias per movie genre. For instance, are male raters more likely to rate “action” movie than female raters ? or are female raters more likely to rate “romance” movie than the male counterparts ? As it turns out, I wasn’t able to use Hive out of the box and had to extend its capability in two different ways (custom mapper and UDF).

Before we dive down in all the complexity of extending Hive, let’s write a simple query to determine gender bias for a single movie category. Below is the HQL for creating required tables and the simple query to determine proportion of male and female raters for the “action” category.

-- define rating table
    user_id int,
    movie_id int,
    rating int,
    timestamp bigint
LOCATION "/user/hue/data/movies/rating";

-- define movie table
    movie_id int,
    title string,
    release_date string,
    video_release_date string,
    imdb_url string,
    unknown tinyint, action tinyint, adventure  tinyint, animation  tinyint,
    children tinyint, comedy  tinyint, crime  tinyint, documentary  tinyint,
    drama  tinyint, fantasy  tinyint, film_noir  tinyint, horror  tinyint,
    musical  tinyint, mystery  tinyint, romance  tinyint, scifi  tinyint,
    thriller  tinyint, war  tinyint, western  tinyint
LOCATION "/user/hue/data/movies/item";

-- define user table
    user_id int,
    job string,
    gender string,
    age smallint
LOCATION "/user/hue/data/movies/user";

-- determine distribution for "action" category
-- If you want to try this for some other category, replace C.action with the other genre column in the movie table
    C.action as genre,
    SUM(IF(B.gender = 'M', 1, 0))/count(*) as male_raters,
    SUM(IF(B.gender = 'F', 1, 0))/count(*) AS female_raters,
    count(*) as all_raters
FROM rating A
JOIN user B ON (A.user_id = B.user_id)
JOIN movie C ON (A.movie_id = C.movie_id)
WHERE c.action = 1
GROUP BY C.action;

However the above solution has two limitations. First, it only calculates the distribution of male and female raters for a single genre. There are 18 genres and if we intent on finding gender bias for all the 18 genres then we will need to run the above query 18 times. Second, the above query conducts no significance test to determine if the gender distribution is significantly different from the expected distribution. In order to do that we will need to download the dataset and use some other tool such as R, Octave, etc to run the significance test. The good thing is we can easily extend Hive to address both the limitations.

Limitation 1: Gender Distribution for All Genres in a Single Query
As can be seen from the movie table schema, genre information for a given movie is encoded as binary value in 18 different columns. Value of 1 in any of genre column (such as action, romance, etc) indicates that the given movie belongs to that genre. Now let’s say a male rater rated a movie that belongs to both “animation” and “romance” category, we need to increment number of male raters for both the genre by 1. However in order to do that in SQL, we will need to convert column wise genre information to row wise information. Luckily hive allows to inject custom mapper and reducer scripts using “transform” command. Thus we can easily write a simple mapper code in python (or any other language of your choice) to transform columnar movie-genre data to row manner data. The output of the python code will contain two columns: movie_id and genre name. Below is a simple python code for this.

#script name:
import sys
import os

#since the mapper doesn't have access to hive schema,
#we need to tell the python code what are different genre columns are
col_to_genre = {6:"action", 7:"adventure", 8:"animation",
9:"children", 10:"comedy", 11:"crime", 12:"documentary",
13:"drama", 14:"fantasy", 15:"film_noir", 16:"horror",
17:"musical", 18:"mystery", 19:"romance", 20:"scifi",
21:"thriller", 22:"war", 23:"western"}

if __name__ == "__main__":
    for line in sys.stdin:
        #B default Hive will use tab as a delimiter for columns
        tokens = line.split("\t")
        for column in range(6,24):
            if 1 == int(tokens[column]):
                print "\t".join([tokens[0], col_to_genre[column]])

Now, its easy to call the above python code from hive as shown below. Note that instead of writing a python mapper you can also write a UDTF. An advantage of UDTF is that we have access to input schema and use that to make our code even more generic. However writing UDTF is lot more harder than writing a simple python mapper.

-- add python script so that its available to mapper and reducers
add file;

-- convert columnar movie-genre data to row format
-- where each row contains movie_id and genre name.
SELECT TRANSFORM(*) using "python" as (movie_id int, genre string)
FROM movie;

Now as a proof of concept, below is a single SQL query that computes gender distribution for all the genre.

-- add python script so that its available to mapper and reducers
add file;

-- determine gender distribution for all the genres in the movielens dataset
    C.genre as genre,
    ROUND(SUM(IF(B.gender = 'M', 1, 0))/count(*), 2) as male_raters,
    ROUND(SUM(IF(B.gender = 'F', 1, 0))/count(*), 2) as female_raters,
    COUNT(*) AS all_raters
FROM rating A
JOIN user B ON (A.user_id = B.user_id)
	SELECT TRANSFORM(*) using "python" as (movie_id int, genre string)
	FROM movie
) C ON (A.movie_id = C.movie_id)
GROUP BY C.genre;

Running the above query returned following result:

genre male_raters female_raters all_raters
action 0.79 0.21 25589
adventure 0.77 0.23 13753
animation 0.72 0.28 3606
children 0.69 0.31 7183
comedy 0.73 0.27 29833
crime 0.78 0.22 8055
documentary 0.75 0.25 758
drama 0.72 0.28 39895
fantasy 0.73 0.27 1352
film_noir 0.78 0.22 1733
horror 0.77 0.23 5317
musical 0.71 0.29 4954
mystery 0.75 0.25 5245
romance 0.7 0.3 19461
scifi 0.79 0.21 12730
thriller 0.77 0.23 21872
war 0.77 0.23 9398
western 0.8 0.2 1854

Limitation 2: Significance Test
Now that we have gender distribution for all the genres, the second challenge is to determine whether the distribution is significantly different from the expected distribution of 72% males and 28% females. We can use ChiSquare test to compare observed and expected distributions. The problem is that there is no built-in function for chiSquare testing. We can easily write a UDF (user defined function) that does that. Below is a java version of a UDF that conducts chi-square test and returns p-value. It takes four arguments:

  1. Expected Percentage of Male Raters
  2. Expected Percentage of Female Raters
  3. Observed Number of Male Raters
  4. Observed Number of Female Raters
import org.apache.commons.math.stat.inference.ChiSquareTestImpl;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;

	name = "ChiSquare Test",
	value = "_FUNC_(str) - Conducts chi square test for two category and returns significance level (pvalue)",
	extended = "Example:\n" +
	"  > SELECT ChiSquare(expected1, expected2, observed1, observed2 ) FROM test a;\n";
public class ChiSquare extends UDF {

	public double evaluate(double e1, double e2, long o1, long o2) throws Exception {
		//sanity check
		if ((e1 + e2) != 1.0){
			throw new Exception("Expected values should sum exactly to 1.0");

		long[] observed = new long[]{o1, o2};

		//calculate total size
		long total_observed = o1 + o2;

		//calculate expected size
		double[] expected = new double[] { e1 * total_observed, e2 * total_observed };

		//conduct ChiSquare test
		ChiSquareTestImpl csi = new ChiSquareTestImpl();
		double pvalue = csi.chiSquareTest(expected, observed);

		//return pvalue upto 3 decimal places
		return (double)Math.round(pvalue * 1000)/1000;

Once the above code is compiled and packaged into jar, it’s easy to use the above UDF to determine which of the category has significant gender bias. Below is the modified query that not only determines proportion of male and female raters per genre but also conducts chi-square test.

-- define ChiSquare function
ADD JAR udf.jar;
CREATE TEMPORARY FUNCTION chiSquareTest AS 'com.meetup.udf.ChiSquare';

-- import file to convert genre col to row
add file;

-- run query
SELECT genre,
       ROUND(male_raters_cnt/all_raters_cnt, 2) as male_raters_prop,
       ROUND(female_raters_cnt/all_raters_cnt, 2) as female_raters_prop,
       chiSquare(0.72, 0.28, male_raters_cnt, female_raters_cnt) as pvalue
        C.genre as genre,
        SUM(IF(B.gender = 'M', 1, 0)) as male_raters_cnt,
        SUM(IF(B.gender = 'F', 1, 0)) as female_raters_cnt,
        COUNT(*) AS all_raters_cnt
    FROM rating A
    JOIN user B ON (A.user_id = B.user_id)
        SELECT TRANSFORM(*) using "python" as (movie_id int, genre string)
        FROM movie
    ) C ON (A.movie_id = C.movie_id)
    GROUP BY C.genre
) gender_distribution
ORDER BY pvalue, male_raters_prop desc;


Below is the obtained result for the above query. From the table below, we can claim following things:

  1. 14 out of 18 genres shows statistically significant gender bias at 5% significance level.
  2. Highlighted in red color, 12 out of the 14 genres (from western to comedy) shows significant male bias. Western genre has the most male bias with 80% of the raters being male. Other genres exhibiting significant male bias are: action, scifi, crime, war, thriller, etc.
  3. Highlighted in green color, 2 out of 14 genres shows significant female bias. These two genres are romance and children.
  4. Drama, musical, fantasy and animation don’t show any statistically significant gender bias.
Genre % Male Raters % Female Raters ChiSquare PValue
western 0.8 0.2 0.0
action 0.79 0.21 0.0
scifi 0.79 0.21 0.0
crime 0.78 0.22 0.0
film_noir 0.78 0.22 0.0
war 0.77 0.23 0.0
thriller 0.77 0.23 0.0
adventure 0.77 0.23 0.0
horror 0.77 0.23 0.0
mystery 0.75 0.25 0.0
comedy 0.73 0.27 0.0
romance 0.7 0.3 0.0
children 0.69 0.31 0.0
documentary 0.75 0.25 0.041
drama 0.72 0.28 0.07
musical 0.71 0.29 0.082
fantasy 0.73 0.27 0.346
animation 0.72 0.28 0.586

One thought on “Detecting Gender Bias Per Movie Genre Using Hive

  1. Great read!! One question – did you assume that the likelihood of rating a genre is same as likelihood of watching a movie from that genre? For example, are women less likely to watch action movie as well? or less likely to rate an action movie once they have watched it? How does these two data (likelihood of watching a genre vs (likelihood of rating | user has watched a movie in that genre) correlated?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s