[Data] It’s Harder than You Think to Extract DAU Separating New and Existing Users in BigQuery

Joshua Kim
3 min readJun 14, 2023

--

Data Pipeline from the website all the way up to Redash

In this article, I’m going to talk about how to extract DAU separating new users and existing users (or returning users) directly from BigQuery. I prepared an insightful example to share some intuition with you, so that you can better understand.

1️⃣ Full BigQuery Code

WITH
CTE_flatten0 AS (
-- Confidential
),

CTE_flatten1 AS (
-- Confidential
),

CTE_users_min_gsn AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
user_pseudo_id,
MIN(ga_session_number) AS min_gsn
FROM CTE_flatten1
WHERE event_name = 'session_start'
GROUP BY event_date, user_pseudo_id
)

SELECT
date,
COUNT(DISTINCT user_pseudo_id) AS dau_all,
COUNT(DISTINCT CASE WHEN min_gsn <> 1 THEN user_pseudo_id END) AS dau_old,
COUNT(DISTINCT CASE WHEN min_gsn = 1 THEN user_pseudo_id END) AS dau_new
FROM CTE_users_min_gsn
GROUP BY date
ORDER BY date
;

2️⃣ Step by Step

First off, extract the daily initial session sequence values of all the users triggering session_start event in the last 365 days.

  • session sequence values are named ga_session_number in BigQuery Export Schema of Google Analytics 4. When a user(user_pseudo_id) starts his/her session for the first time, the ga_session_number in the event parameters equals to 1, and then it increases sequentially each time the user returns and starts the new session again.
CTE_users_min_gsn AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
user_pseudo_id,
MIN(ga_session_number) AS min_gsn
FROM CTE_flatten1
WHERE event_name = 'session_start'
GROUP BY event_date, user_pseudo_id
)

For the second, count as new or existing users based on whether the daily initial session sequence value is 1 or not on a daily basis.

  • A user named Joshua, who first visited your website for example, can trigger multiple session_start events in a single day. Even though he returns to the website after his first visit, he should be still regarded as one of the new users when calculating on a daily basis. That’s why I mentioned above that you need to extract the daily “initial” session sequence values of all the users.
SELECT
date,
COUNT(DISTINCT user_pseudo_id) AS dau_all,
COUNT(DISTINCT CASE WHEN min_gsn <> 1 THEN user_pseudo_id END) AS dau_old,
COUNT(DISTINCT CASE WHEN min_gsn = 1 THEN user_pseudo_id END) AS dau_new
FROM CTE_users_min_gsn
GROUP BY date
ORDER BY date

3️⃣ An Intuitive Example

On January 1, the DAU is like the following:

  • 3 New Users (Joshua, Messi, and Jordan)
  • 1 Existing User (James)

Keep in mind that Joshua and Jordan are regarded as the new users despite the fact that they made their second visit.

STEP 1. Identify the Session Sequence Values of All the Visitors.

STEP 2. Aggregate Each Visitor with Minimum Session Sequence Value

STEP 3. Classify Each Visitor as a New User or an Existing User based on whether or not the Minimum Session Sequence Value is 1.

4️⃣Review Full BigQuery Code

WITH
CTE_flatten0 AS (
-- Confidential
),

CTE_flatten1 AS (
-- Confidential
),

CTE_users_min_gsn AS (
SELECT
PARSE_DATE('%Y%m%d', event_date) AS date,
user_pseudo_id,
MIN(ga_session_number) AS min_gsn
FROM CTE_flatten1
WHERE event_name = 'session_start'
GROUP BY event_date, user_pseudo_id
)

SELECT
date,
COUNT(DISTINCT user_pseudo_id) AS dau_all,
COUNT(DISTINCT CASE WHEN min_gsn <> 1 THEN user_pseudo_id END) AS dau_old,
COUNT(DISTINCT CASE WHEN min_gsn = 1 THEN user_pseudo_id END) AS dau_new
FROM CTE_users_min_gsn
GROUP BY date
ORDER BY date
;

Joshua Kim

Photo by Caspar Camille Rubin on Unsplash

--

--