[Data] It’s Harder than You Think to Extract DAU Separating New and Existing Users in BigQuery
In this article, I’m going to talk about how to extract DAU separating new users and existing users (or returning users) directly from BigQuery. I prepared an insightful example to share some intuition with you, so that you can better understand.
1️⃣ Full BigQuery Code
WITH
CTE_flatten0 AS (
-- Confidential
),
CTE_flatten1 AS (
-- Confidential
),
CTE_users_min_gsn AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
user_pseudo_id,
MIN(ga_session_number) AS min_gsn
FROM CTE_flatten1
WHERE event_name = 'session_start'
GROUP BY event_date, user_pseudo_id
)
SELECT
date,
COUNT(DISTINCT user_pseudo_id) AS dau_all,
COUNT(DISTINCT CASE WHEN min_gsn <> 1 THEN user_pseudo_id END) AS dau_old,
COUNT(DISTINCT CASE WHEN min_gsn = 1 THEN user_pseudo_id END) AS dau_new
FROM CTE_users_min_gsn
GROUP BY date
ORDER BY date
;
2️⃣ Step by Step
First off, extract the daily initial session sequence values of all the users triggering session_start
event in the last 365 days.
- session sequence values are named
ga_session_number
in BigQuery Export Schema of Google Analytics 4. When a user(user_pseudo_id
) starts his/her session for the first time, thega_session_number
in the event parameters equals to 1, and then it increases sequentially each time the user returns and starts the new session again.
CTE_users_min_gsn AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
user_pseudo_id,
MIN(ga_session_number) AS min_gsn
FROM CTE_flatten1
WHERE event_name = 'session_start'
GROUP BY event_date, user_pseudo_id
)
For the second, count as new or existing users based on whether the daily initial session sequence value is 1 or not on a daily basis.
- A user named Joshua, who first visited your website for example, can trigger multiple
session_start
events in a single day. Even though he returns to the website after his first visit, he should be still regarded as one of the new users when calculating on a daily basis. That’s why I mentioned above that you need to extract the daily “initial” session sequence values of all the users.
SELECT
date,
COUNT(DISTINCT user_pseudo_id) AS dau_all,
COUNT(DISTINCT CASE WHEN min_gsn <> 1 THEN user_pseudo_id END) AS dau_old,
COUNT(DISTINCT CASE WHEN min_gsn = 1 THEN user_pseudo_id END) AS dau_new
FROM CTE_users_min_gsn
GROUP BY date
ORDER BY date
3️⃣ An Intuitive Example
On January 1, the DAU is like the following:
- 3 New Users (Joshua, Messi, and Jordan)
- 1 Existing User (James)
Keep in mind that Joshua and Jordan are regarded as the new users despite the fact that they made their second visit.
STEP 1. Identify the Session Sequence Values of All the Visitors.
STEP 2. Aggregate Each Visitor with Minimum Session Sequence Value
STEP 3. Classify Each Visitor as a New User or an Existing User based on whether or not the Minimum Session Sequence Value is 1.
4️⃣Review Full BigQuery Code
WITH
CTE_flatten0 AS (
-- Confidential
),
CTE_flatten1 AS (
-- Confidential
),
CTE_users_min_gsn AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
user_pseudo_id,
MIN(ga_session_number) AS min_gsn
FROM CTE_flatten1
WHERE event_name = 'session_start'
GROUP BY event_date, user_pseudo_id
)
SELECT
date,
COUNT(DISTINCT user_pseudo_id) AS dau_all,
COUNT(DISTINCT CASE WHEN min_gsn <> 1 THEN user_pseudo_id END) AS dau_old,
COUNT(DISTINCT CASE WHEN min_gsn = 1 THEN user_pseudo_id END) AS dau_new
FROM CTE_users_min_gsn
GROUP BY date
ORDER BY date
;
Joshua Kim
- Data Analyst
- Master in AI Big Data
- LinkedIn Bio | LinkTree
- joshuajkim413@gmail.com