Writing Hive Custom Aggregate Functions (UDAF): Part I – Setting Eclipse

Writing your first user defined aggregation functions (UDAF) for hive can be a daunting task. In particular I found these three challenges while working on my first UDAF:

  1. No instructions on how to setup eclipse for UDAF development
  2. Often complicated instructions on how to write your first UDAF.
  3. No clear instructions on how to debug the UDAF.

In this and following post, I will address each of the above challenges and walk through my solution. In this post, I will show how to setup eclipse.

Step 1: Install Eclipse
This is straight forward. Simply download and install Eclipse IDE for Java Developer. The instructions over here are based on the Juno version of Eclipse IDE. However they shouldn’t change much even if you are using some other version of the Eclipse IDE.

Step 2: Install Maven Plugin for Eclipse
Apache Maven is an automation tool that helps with java dependency management system and also to compile and build jars. We will use Maven to maintain our dependencies on other libraries. In order to install Maven plugin for Eclipse, follow these steps:

  1. Start Eclipse
  2. Click on Help > Install New Software
  3. In the popup box, set “Work With” text box to http://download.eclipse.org/technology/m2e/releases and press enter.
  4. Select “Maven Integration for Eclipse” and click “Next”.
  5. Just follow prompts and finally click “Finish”

maven

Step 3: Create New Maven Project

  1. From the top menu, select File > New Project > Other > Maven Project
  2. Go with the default setting and enter ArtifactID and GroupID. Artifact ID is generally a project name and group id is generally something like this “org.apache.hive”.

project

Step 4: Install Dependencies.
In order to build and compile our UDAF, we will need hive-exec jar.

  1. Open your java project and click on pom.xml
  2. At the bottom of the editor you should be able to see several tabs such as “Dependencies”, “Hierarchical Dependencies”, etc. If you don’t see these tabs then right click on pom.xml > Open With > Maven POM Editor.
  3. Click on the “Dependencies” tab.
  4. In the search box, enter “hive-exec” and wait for maven to pull repositories. Select “org.apache.hive” and select “0.11” version. Change scope from “compile” to “provided” as hive jar is needed only for development and testing. It should be available when running the UDF or UDAF through hive. Finally click ok.
    hive
  5. Save the pom.xml file. At this point you might notice an error message indicating that the jdo2-api 2.3-ec is not available. In order to fix this error, download jdo2-api 2.3-ec from here: http://www.datanucleus.org/downloads/maven2/javax/jdo/jdo2-api/2.3-ec/. Then, open pom.xml file in a text editor and add the following lines within the “dependencies” tag.
<dependency>
   <groupId>javax.jdo</group­Id>
   <artifactId>jdo2-api</art­ifactId>
   <version>2.3-ec</version>
   <scope>system</scope>
   <systemPath>[Complete path to the jar]/jdo2-api-2.3-ec.jar</sy­stemPath>
</dependency>

Save the pom.xml file. Now right click on the project name > Maven > update dependencies. The problem should be gone now.
Alternatively you can install jdo2-api to your maven repository by running the following command from your terminal:

mvn install:install-file \
-Dfile=[REPLACE THIS WITH COMPLETE PATH]/jdo2-api-2.3-ec.jar \
-DgroupId=javax.jdo -DartifactId=jdo2-api \
-Dversion=2.3-ec -Dpackaging=jar

Great. Check out my next post on how to write your first user defined aggregate functions.