[Data] Let’s Create the Sankey Chart

Joshua Kim
9 min readMay 21, 2023

--

Data Pipeline from the website all the way up to Redash

In this article, I’m going to tell you how you can create the Sankey Chart from GA4, BigQuery, and up to Redash.

https://www.originlab.com/doc/Origin-Help/Sankey-Diagram

Sankey Chart, oftentimes also called Sankey Diagram, is a type of flow visualization in which the width of the arrows is proportional to the flow rate of the depicted extensive property. [Wikipedia]

In the Data Analytics world though, we can figure out how our visitors make their journey throughout our website pages or app screens.

With the Sankey Chart, we can apply our strategies as shown below:

  • “Many purchasing visitors mainly make a journey starting from page A, page B, page C, and finally make their purchase. Therefore, we might be able to create a CTA button in order to drive the other visitors to this journey, so that we could make our revenue much higher.”
  • “Page A and Page B are the points where our visitors bounce off the most. Bet those two pages have some problems in terms of psychology or technology. Let’s now dive into specific user interviews or research to deal with this bounce-off rate.”

GA4’s Path Analysis

  • Although GA4 supports some custom visualizations and you can explore the path analysis to figure out how your visitors take their journey throughout the websites, it’s literally awful and much of a hassle if you’re eager to get some insights regarding that. (See below how awful it is.)

Now that hopefully you got deeply understood what is the Sankey Chart and why it’s important in product analytics, it’s time to show you how to create the Sankey Chart from A to Z.

1. Flatten the Table

Although it’s really handy to connect the GA4 events to BigQuery, there’s a complicated issue hard to handle; some of the datatypes imported to BigQuery are STRUCT type. It’s regarded as an array type, or you can imagine this as a STRUCT type column that allows each row to have a multiple-dimensional value in it.

You can simply use UNNEST to flatten each STRUCT type column in advance of your main query.

WITH
CTE_flatten1 AS (
SELECT
event_date, event_timestamp,

user_pseudo_id, ga_session_id.value.int_value AS session_id, ga_session_number.value.int_value AS session_number,
event_name, page_location.value.string_value AS page_location, ecommerce.purchase_revenue_in_usd AS revenue_usd,
geo.country, device.category AS device,

utm_campaign.value.string_value AS utm_campaign,
utm_medium.value.string_value AS utm_medium,
utm_source.value.string_value AS utm_source,
page_referrer.value.string_value AS page_referrer
FROM `your_table.events_*`
LEFT JOIN UNNEST (event_params) AS ga_session_id ON ga_session_id.key = 'ga_session_id'
LEFT JOIN UNNEST (event_params) AS ga_session_number ON ga_session_number.key = 'ga_session_number'
LEFT JOIN UNNEST (event_params) AS page_location ON page_location.key = 'page_location'
LEFT JOIN UNNEST (event_params) AS utm_campaign ON utm_campaign.key = 'campaign'
LEFT JOIN UNNEST (event_params) AS utm_medium ON utm_medium.key = 'medium'
LEFT JOIN UNNEST (event_params) AS utm_source ON utm_source.key = 'source'
LEFT JOIN UNNEST (event_params) AS page_referrer ON page_referrer.key = 'page_referrer'
WHERE
_table_suffix BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)) AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
AND ga_session_number.value.int_value = 1
AND user_pseudo_id IS NOT NULL
AND ga_session_id IS NOT NULL
),

LEFT JOIN UNNEST (event_params) AS ga_session_id ON ga_session_id.key = ‘ga_session_id’

  • Each event may have ga_session_id in its parameters, and we have to grab it in order to follow through each user session’s journey based on the session ID.

LEFT JOIN UNNEST (event_params) AS ga_session_number ON ga_session_number.key = ‘ga_session_number’

  • Each event may have ga_session_number in its parameters, and we have to look through it in order to know if each session is the first visiting user’s or returning user’s session.

LEFT JOIN UNNEST (event_params) AS page_location ON page_location.key = ‘page_location’

  • Most of each event has page_location in its parameters, and of course, we have to know page_location in order to follow through with their journey based on the page URL.

LEFT JOIN UNNEST (event_params) AS utm_campaign ON utm_campaign.key = ‘campaign’

  • Many times, each session has a totally different shape of the journey depending on what kind of campaign channel the session has started from. Therefore, let’s also query utm_campaign just in case.

LEFT JOIN UNNEST (event_params) AS utm_medium ON utm_medium.key = ‘medium’

  • Many times, each session has a totally different shape of the journey depending on what kind of medium channel the session has started from. Therefore, let’s also query utm_medium just in case.

LEFT JOIN UNNEST (event_params) AS utm_source ON utm_source.key = ‘source’

  • Many times, each session has a totally different shape of the journey depending on what kind of source channel the session has started from. Therefore, let’s also query utm_source just in case.

LEFT JOIN UNNEST (event_params) AS page_referrer ON page_referrer.key = ‘page_referrer’

  • Many times, each session has a totally different shape of the journey depending on exactly what previous page URL the session has started from. Therefore, let’s also query page_referrer just in case.

AND ga_session_number.value.int_value = 1
AND user_pseudo_id IS NOT NULL
AND ga_session_id IS NOT NULL

  • Here, I wanted to know specifically about the first visiting users’ journey, and that’s why I try to query with ga_session_number only as 1.
  • We basically have to track the user_pseudo_id and ga_session_id in order to each user journey has been there from the same user and same session. That’s why I dropped all the missing user_pseudo_id and ga_session_id instances.

2. Make Sure each user and session has the same properties(or parameters).

Sometimes, even though it’s rooted in the same session ID, some events have their parameters fully, but others don’t. That’s why we make duplicates for the parameters throughout the same user ID and session ID.

This is a real BigQuery table appended from GA4, and look only one event has the medium value.

You can use FIRST_VALUE and LAST_VALUE functions to make the duplicates.

CTE_flatten2 AS (
SELECT
event_date, event_timestamp,
user_pseudo_id, session_id, session_number,
event_name, REGEXP_REPLACE(REGEXP_REPLACE(LOWER(page_location), r'(\?.*)$', ''), r'/$', '') AS page_location, revenue_usd,
FIRST_VALUE(country) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS country,
FIRST_VALUE(device) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS device,
LAST_VALUE(utm_campaign) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS utm_campaign,
LAST_VALUE(utm_medium) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS utm_medium,
LAST_VALUE(utm_source) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS utm_source,
LAST_VALUE(REGEXP_REPLACE(REGEXP_REPLACE(LOWER(page_referrer), r'(\?.*)$', ''), r'/$', '')) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS page_referrer
FROM CTE_flatten1
),

REGEXP_REPLACE(REGEXP_REPLACE(LOWER(page_location), r’(\?.*)$’, ‘’), r’/$’, ‘’) AS page_location

FIRST_VALUE(country) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS country

  • Because of some technical problems, one user and session might have multiple countries. Regarding this issue, I would make a decision to get the very first country value to know where the user and session have started.
  • Let’s assume a visitor is aboard a plane, and when he/she starts the session the plane is located in the US, but when visiting the following page, the plane has just entered the Canada area. In this extreme case, we’d rather think of this visitor as starting from the US.

3. Delete the Consecutive Same Page URL Duplicates in the Same Session ID

In most cases, if a visitor lands on page A, multiple events can happen.

  • Say, session_start, page_view, user_engagement, scroll, and whatnot

We need to delete all the same Page URLs triggered consecutively in one session, but only with the unique page URL.

CTE_flatten3 AS (
SELECT
event_date, event_timestamp,
user_pseudo_id, session_id, session_number,
event_name,
page_location,
LAG(page_location) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS previous_page_location,
LEAD(page_location) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS next_page_location,
revenue_usd,
country, device,
utm_campaign, utm_medium, utm_source, page_referrer
FROM (
SELECT
*,
CASE
WHEN -- Previous URL IS NULL
previous_page_location IS NULL THEN 'remain'
WHEN -- Previous URL <> Current URL
previous_page_location <> page_location THEN 'remain'
WHEN -- Previous URL = Current URL
previous_page_location = page_location AND page_location <> next_page_location THEN 'del'
END AS remain_or_del
FROM (
SELECT
event_date, event_timestamp,
user_pseudo_id, session_id, session_number,
event_name,
page_location,
LAG(page_location) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS previous_page_location,
LEAD(page_location) OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS next_page_location,
revenue_usd,
country, device,
utm_campaign, utm_medium, utm_source, page_referrer
FROM CTE_flatten2
)
)
WHERE remain_or_del = 'remain'
ORDER BY user_pseudo_id, session_id, event_timestamp
),

Here in the query above, I used LAG and LEAD window functions to know whether to delete duplicates based on the previous page URL and the next page URL.

4. Label Each Page’s Visiting Order Number for Each Session

We can Label visit_order for each page URL based on the event_timestamp in ascending order.

  • Here, I used the ROW_NUMBER window function to do it.
CTE_user_session_visit_pages AS (
SELECT
event_date, event_timestamp,
user_pseudo_id, session_id,
page_location, ROW_NUMBER() OVER(PARTITION BY user_pseudo_id, session_id ORDER BY event_timestamp) AS visit_order,
country, device,
utm_campaign, utm_medium, utm_source, page_referrer
FROM CTE_flatten3
ORDER BY user_pseudo_id, session_id, visit_order
),

5. Categorize All the Pages

The page URL itself looks dirty since it’s not that human-readable, so here I wanted to categorize all the pages existing in the whole domain to make them literally human-readable.

CTE_user_session_visit_pagegroups AS (
SELECT
event_date, event_timestamp,
user_pseudo_id, session_id,
page_location,
CASE
WHEN page_location = 'https://google.com' THEN 'Google Home'
WHEN page_location = 'https://google.com/joshua' THEN 'Joshua Intro'
WHEN page_location = 'https://google.com/andrew' THEN 'Andrew Intro'
WHEN page_location = 'https://google.com/amber' THEN 'Amber Intro'
WHEN CONTAINS_SUBSTR(page_location, 'https://google.com/policy') = True THEN 'Policy Pages'
WHEN CONTAINS_SUBSTR(page_location, 'https://google.com/events') = True THEN 'Event Pages'
ELSE 'ETC'
END AS page_group,
visit_order,
country, device,
utm_campaign, utm_medium, utm_source, page_referrer
FROM CTE_user_session_visit_pages
ORDER BY user_pseudo_id, session_id, visit_order
),

6. Pivot Each Session’s Journey up to STEP 10

CTE_sankey AS (
SELECT
user_pseudo_id, session_id, country, device, utm_campaign, utm_medium, utm_source, page_referrer,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 1)) AS step01,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 2)) AS step02,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 3)) AS step03,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 4)) AS step04,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 5)) AS step05,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 6)) AS step06,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 7)) AS step07,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 8)) AS step08,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 9)) AS step09,
MAX((SELECT page_group FROM CTE_user_session_visit_pagegroups SUB
WHERE MAIN.user_pseudo_id = SUB.user_pseudo_id AND MAIN.session_id = SUB.session_id AND visit_order = 10)) AS step10
FROM CTE_user_session_visit_pagegroups MAIN
GROUP BY
user_pseudo_id, session_id, country, device, utm_campaign, utm_medium, utm_source, page_referrer
ORDER BY
step01, step02, step03, step04, step05, step06, step07, step08, step09, step10
)

7. Make the Query Result Compatible with Redash

Since Redash Sankey Chart has its expected framework and naming rules of the query result, we need to make the final query result in order to fit the rule.

Redash Sankey Visualization’s Constraints
SELECT
step01 AS stage1, step02 AS stage2, step03 AS stage3, step04 AS stage4, step05 AS stage5,
COUNT(*) AS value
FROM CTE_sankey
GROUP BY step01, step02, step03, step04, step05
HAVING step01 = 'Google Home' AND value >= 100
ORDER BY value DESC;

HAVING step01 = ‘Google Home’ AND value >= 100

  • I just wanted to see how the users make their journey who is starting from Google Home.
  • To make the visualization look simple and not distracting, I ordered the query to show me the types of user journeys of at least 100 users.

8. Final Result

Redash returns this kind of Sankey Chart.

9. Conclusion

  • Redash Sankey Chart is much more valuable specifically when you try to figure out how the users make their journey depending on their countries, devices, UTM parameters, referrer pages, and stuff like that.
  • It’s superbly complicated to query, but once you make the whole framework queries to return the Sankey table, I believe you can generate more value to do as many actions as possible which wouldn’t have been possible to do before.
  • Nevertheless, Redash’s Sankey Visualization only supports up to STEP 5, which by that means if your website or app has too many pages and screens, it wouldn’t be enough to look through only 5 steps.

Joshua Kim

--

--

Joshua Kim
Joshua Kim

Written by Joshua Kim

Analytics Engineer | 🇰🇷🇺🇸🇹🇼

No responses yet