DRY (don’t repeat yourself) is one of the fundamental principles of software engineering. The main idea is to avoid duplicating business/processing logic throughout the code. However, I rarely see it being applied when writing SQL queries; making it difficult to understand and maintain them.
Below are few tips on making HiveQL DRY.
- Use “Macro” to express transformation logic that can be easily expressed in SQL.
- Consider using “With Clause” to express subqueries.
- Use variables to define constants.
Macros
Macros allows to assign an alias to a reusable processing logic that can be expressed in SQL. In simple terms, its like defining a function purely in SQL (although it doesn’t operate that way. It will do an inline expansion but we don’t have to worry about it for now).
For instance, in the below table we have two duration fields where the duration value is expressed in different units (such as milliseconds, seconds, minutes, etc).
UUID | duration1 | duration2 |
---|---|---|
1 | 10ms | 20us |
2 | 16s | 20ms |
3 | 5m | 2us |
Below is a typical way of writing the HiveQL for this. Its bad because we have duplicated (once for each field) the logic of converting the duration expressed as string to duration in seconds. Anytime we make changes to it we will have to make sure to update the logic everywhere in the code.
SELECT UUID, CASE WHEN duration1 like '%us' THEN CAST(REPLACE(duration1, 'us', '') AS DOUBLE) / 1.0E6 WHEN duration1 like '%ms' THEN CAST(REPLACE(duration1, 'ms', '') AS DOUBLE) / 1000.0 WHEN duration1 like '%s' THEN CAST(REPLACE(duration1, 's', '') AS DOUBLE) WHEN duration1 like '%m' THEN CAST(REPLACE(duration1, 'm', '') AS DOUBLE) * 60 ELSE NULL END as duration1_seconds, CASE WHEN duration2 like '%us' THEN CAST(REPLACE(duration2, 'us', '') AS DOUBLE) / 1.0E6 WHEN duration2 like '%ms' THEN CAST(REPLACE(duration2, 'ms', '') AS DOUBLE) / 1000.0 WHEN duration2 like '%s' THEN CAST(REPLACE(duration2, 's', '') AS DOUBLE) WHEN duration2 like '%m' THEN CAST(REPLACE(duration2, 'm', '') AS DOUBLE) * 60 ELSE NULL END as duration2_seconds FROM ( SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2 UNION ALL SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2 UNION ALL SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2 ) A
DRY way to rewrite the above query is to utilize “macro”. We first define a macro “DURATION_IN_SECONDS” and use it convert all the duration fields as shown below.
-- define macro to convert duration string to duration in seconds CREATE TEMPORARY MACRO DURATION_IN_SECONDS (t string) CASE WHEN t like '%us' THEN CAST(REPLACE(t, 'us', '') AS DOUBLE) / 1.0E6 WHEN t like '%ms' THEN CAST(REPLACE(t, 'ms', '') AS DOUBLE) / 1000.0 WHEN t like '%s' THEN CAST(REPLACE(t, 's', '') AS DOUBLE) WHEN t like '%m' THEN CAST(REPLACE(t, 'm', '') AS DOUBLE) * 60 ELSE NULL END; SELECT UUID, -- use macro to convert first duration field DURATION_IN_SECONDS(duration1) duration1_seconds, -- use macro to convert second duration field DURATION_IN_SECONDS(duration2) duration2_seconds FROM ( SELECT 1 AS UUID, '10ms' as duration1, '20us' as duration2 UNION ALL SELECT 2 AS UUID, '16s' as duration1, '20ms' as duration2 UNION ALL SELECT 3 AS UUID, '5m' as duration1, '2us' as duration2 ) A
With Clause
Below is an example of another typical query. In the SQL below, we use tableC to filter tableA and tableB and then join the two together. The logic on how to filter tableC itself has been duplicated.
SELECT * FROM ( SELECT TableA.* FROM TableA JOIN TableC ON (TableA.id = TableC.id) WHERE TableA.datestr >= '2017-01-01' -- filters on table C AND TableC.datestr >= '2017-01-01' AND TableC.status != 0 ) A JOIN ( SELECT TableB.* FROM TableB JOIN TableC ON (TableB.id = TableC.id) WHERE TableB.datestr >= '2017-01-01' -- filters on table C AND TableC.datestr >= '2017-01-01' AND TableC.status != 0 ) B On (A.id = B.id)
Here, using “With” clause can help us make this query DRY. We first express the logic of filtering table C and assign it an alias. Next we join tableA and tableB to this alias.
-- express logic to filter table C over here WITH FilteredTableC AS ( SELECT * FROM TableC WHERE datestr >= '2017-01-01' AND status != 0 ) SELECT * FROM ( SELECT TableA.* FROM TableA JOIN FilteredTableC ON (TableA.id = FilteredTableC.id) WHERE TableA.datestr >= '2017-01-01' ) A JOIN ( SELECT TableB.* FROM TableB JOIN FilteredTableC ON (TableB.id = FilteredTableC.id) WHERE TableB.datestr >= '2017-01-01' ) B On (A.id = B.id)
“With” Clause not only helps with making a SQL DRY, but is also very useful in breaking a big sql involving many joins into smaller easy self summarizing chunks. For instance below is an example of a query that joins three tables together. Even in this simple query it becomes difficult to understand the goal as there is a list of filters that we are applying to different tables.
SELECT drivers.*, riders.* FROM trips JOIN drivers ON drivers.driver_id = trips.driver_id JOIN riders ON riders.rider_id = trips.rider_id WHERE trips.datestr >= '2017-01-01' AND trips.status = 0 AND trips.city = 'SF' AND drivers.joined >= '2017-01-01' AND drivers.status = 'active' AND riders.joined >= '2017-01-01' AND riders.name like 'XYZ%'
Using “With Clause” allows to rewrite the above query in much more legible way. Each table is separately filtered and assigned a readable alias which is then used in the main query.
WITH SuccessfulTrips as ( SELECT * FROM trips WHERE trips.status = 0 AND trips.datestr >= '2017-01-01' ), ActiveDrivers as ( SELECT * FROM drivers WHERE drivers.status = 'active' AND drivers.joined >= '2017-01-01' ), XYZRiders as ( SELECT * FROM riders WHERE riders.name like 'XYZ%' AND riders.joined >= '2017-01-01' ) SELECT ActiveDrivers.*, XYZRiders.* FROM SuccessfulTrips JOIN ActiveDrivers ON (SuccessfulTrips.driver_id = ActiveDrivers.driver_id) JOIN XYZRiders ON (SuccessfulTrips.rider_id = XYZRiders.rider_id)
Variables
Often we use same constant values in multiple places. Instead of copying these constant values all over the place we can easily define a variable and use the variable.
SET start_date = '2017-01-01'; SET end_date = '2017-05-01'; SELECT A.*, B.* FROM A.* JOIN B.* ON (A.id = B.id) WHERE A.datestr >= ${hiveconf:start_date} AND A.datestr <= ${hiveconf:end_date} AND B.datestr >= ${hiveconf:start_date} AND B.datestr <= ${hiveconf:end_date}
There are few different options for setting variables in hive. Make sure to read comments on this stackoverflow post