Unnesting JSON objects in Redshift
Using UNPIVOT to iterate over objects
Posted by Owen Stephens on November 12, 2021
Redshift and ELT
AWS Redshift is Amazon's managed data warehouse service, which we make heavy use of at Bamboo. A common process when using a data warehouse is Extract, Transform, Load (ETL). In short, ETL is the process of extracting data from a source system/database, transforming it into a representation suitable for use in a (relational) data warehouse and then loading the transformed data into the warehouse.
A common theme when using Redshift is to flip the order of the Transform and Load steps, and instead load raw data extracted from a source system directly into Redshift, and then use Redshift's compute power to perform any transformations. This is known as ELT. ELT is beneficial because it often removes the need for a separate transformation tool, reducing effort and cost to make data available in Redshift.
An example of Redshift's support for ELT is the SUPER
column type,
which allows the storage of structured (JSON) data directly in Redshift
relations. Recently, AWS have improved their support for transforming such
structured data with the new UNPIVOT
keyword to destructure JSON
objects.
In this post we'll demonstrate UNPIVOT
and how it enhances Redshift's ELT
abilities.
Example: inventory tracking
Consider an imaginary inventory tracking system that tracks the inventory of
several shops, where each shop has an inventory of arbitrary items; assume that
the shop's source systems store the inventory as JSON objects. We can store
this structured data by parsing JSON into the SUPER column type using
json_parse
. (For this post, we will use a temporary table, but
the queries would also work with a non-temporary table.)
> CREATE TEMPORARY TABLE example_data AS (
SELECT 10 AS shop_id,
json_parse('{"apple_count": 2, "orange_count": 6}') AS inventory
UNION ALL
SELECT 20, json_parse('{"pear_count": 10, "other_data": 42}')
UNION ALL
SELECT 30, json_parse('{"apple_count": 3, "lemon_count": 5}')
)
> SELECT * FROM example_data
+-----------+------------------------------------+
| shop_id | inventory |
|-----------+------------------------------------|
| 10 | {"apple_count":2,"orange_count":6} |
| 20 | {"pear_count":10,"other_data":42} |
| 30 | {"apple_count":3,"lemon_count":5} |
+-----------+------------------------------------+
We can extract a particular item's count from each inventory:
> SELECT shop_id, 'apple' AS item_name, inventory.apple_count AS count
FROM example_data
WHERE count IS NOT NULL
+-----------+-------------+---------+
| shop_id | item_name | count |
|-----------+-------------+---------|
| 10 | apple | 2 |
| 30 | apple | 3 |
+-----------+-------------+---------+
and could count the total item count across all shops:
> SELECT 'apple' AS item_name, sum(inventory.apple_count) AS total
FROM example_data
+-------------+---------+
| item_name | total |
|-------------+---------|
| apple | 5 |
+-------------+---------+
However, it would be inconvenient to try and do the same count for all item types (especially if we don't know when writing the query all the types that exist). We could use something like:
> SELECT *
FROM (
SELECT shop_id, 'apple' AS item_name, inventory.apple_count AS count
FROM example_data
WHERE count IS NOT NULL
UNION ALL
SELECT shop_id, 'orange' AS item_name, inventory.orange_count AS count
FROM example_data
WHERE count IS NOT NULL
UNION ALL
SELECT shop_id, 'pear' AS item_name, inventory.pear_count AS count
FROM example_data
WHERE count IS NOT NULL
UNION ALL
SELECT shop_id, 'lemon' AS item_name, inventory.lemon_count AS count
FROM example_data
WHERE count IS NOT NULL
) d
ORDER BY shop_id, item_name
+-----------+-------------+---------+
| shop_id | item_name | count |
|-----------+-------------+---------|
| 10 | apple | 2 |
| 10 | orange | 6 |
| 20 | pear | 10 |
| 30 | apple | 3 |
| 30 | lemon | 5 |
+-----------+-------------+---------+
but that does not seem easy to read, or maintain, due to the level of duplication.
Luckily, using UNPIVOT
we do not need to hard-code item names, and can extract
each key-value mapping from the inventory object as a separate row:
> SELECT shop_id, key, value
FROM example_data ed, UNPIVOT ed.inventory AS value AT key
ORDER BY shop_id, key
+-----------+--------------+---------+
| shop_id | key | value |
|-----------+--------------+---------|
| 10 | apple_count | 2 |
| 10 | orange_count | 6 |
| 20 | other_data | 42 |
| 20 | pear_count | 10 |
| 30 | apple_count | 3 |
| 30 | lemon_count | 5 |
+-----------+--------------+---------+
N.b. that the comma in FROM example_data ed, UNPIVOT
is intentional, and represents
a CROSS JOIN in SQL-89 syntax. We could equivalently
write FROM example_data ed CROSS JOIN UNPIVOT ed.inventory AS value AT key
,
but use the comma syntax since it matches the Redshift documentation.
Now we can filter-out the non-count rows based on their key, remove the '_count' suffix and rename the columns to obtain the same result as our verbose query above:
> SELECT shop_id, regexp_replace(key, '_count$', '') AS item_name, value AS count
FROM example_data ed, UNPIVOT ed.inventory AS value AT key
WHERE key LIKE '%_count'
ORDER BY shop_id, item_name
+-----------+-------------+---------+
| shop_id | item_name | count |
|-----------+-------------+---------|
| 10 | apple | 2 |
| 10 | orange | 6 |
| 20 | pear | 10 |
| 30 | apple | 3 |
| 30 | lemon | 5 |
+-----------+-------------+---------+
The result is exactly the same as for our verbose query, but we haven't had to write a duplication-heavy query, or explicitly name each item that we are interested in; this is a great improvement!
Combining UNPIVOT and unnesting
We can take our example one step further, and nest the shop_id and associated inventory into a single object, and then store those objects in an array. That is, as JSON:
[
{"shop_id": 10, "inventory": {"apple_count": 2, "orange_count": 6}},
{"shop_id": 20, "inventory": {"pear_count": 10, "other_data": 42}},
{"shop_id": 30, "inventory": {"apple_count": 3, "lemon_count": 5}}
]
We construct another temporary table for our example:
> CREATE TEMPORARY TABLE nested_example_data AS (
SELECT json_parse('
[
{"shop_id": 10, "inventory": {"apple_count": 2, "orange_count": 6}},
{"shop_id": 20, "inventory": {"pear_count": 10, "other_data": 42}},
{"shop_id": 30, "inventory": {"apple_count": 3, "lemon_count": 5}}
]
') AS shop_inventories
)
and can then combine the existing support for unnesting JSON arrays and new support for unnesting objects:
> SELECT shop_index, shop.shop_id, key, value
FROM nested_example_data AS ned,
ned.shop_inventories AS shop AT shop_index,
UNPIVOT shop.inventory AS value AT KEY
+--------------+-----------+--------------+---------+
| shop_index | shop_id | key | value |
|--------------+-----------+--------------+---------|
| 0 | 10 | apple_count | 2 |
| 0 | 10 | orange_count | 6 |
| 1 | 20 | pear_count | 10 |
| 1 | 20 | other_data | 42 |
| 2 | 30 | apple_count | 3 |
| 2 | 30 | lemon_count | 5 |
+--------------+-----------+--------------+---------+
notice that we unnest the array, obtaining each object as shop
(and capturing the 0-based array index in shop_index
) and then unnest shop
as before, using UNNEST
.
Conclusion
In this post we have shown an example where the new UNPIVOT
keyword in
Redshift helps us write powerful transformation queries directly in Redshift,
greatly enhancing our ability to write ELT-style transformations.