DENG-8767: Create events first seen table #8046

kbammarito · 2025-09-03T17:35:40Z

Description

This PR adds Events First Seen into the Glean Usage SQL generator.

Related Tickets & Documents

Reviewer, please follow this checklist

sql_generators/glean_usage/events_first_seen.py

scholtzan · 2025-09-03T21:20:07Z

I think CI might be failing because in the generated metadata.yaml file the task_group isn't correctly resolved:

scheduling:
  dag_name: bqetl_glean_usage
  task_group: { { app_name } }
  date_partition_parameter: null

sql_generators/glean_usage/templates/events_first_seen_v1.metadata.yaml

…data.yaml Co-authored-by: Anna Scholtz <[email protected]>

…create-events-first-seen-table

…:mozilla/bigquery-etl into DENG-8767-create-events-first-seen-table

dataops-ci-bot · 2025-12-09T15:55:01Z

Integration report for "Merge branch 'main' into DENG-8767-create-events-first-seen-table"

`sql.diff`

Click to expand!

diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_glean_usage.py /tmp/workspace/generated-sql/dags/bqetl_glean_usage.py
--- /tmp/workspace/main-generated-sql/dags/bqetl_glean_usage.py	2025-12-09 15:51:29.000000000 +0000
+++ /tmp/workspace/generated-sql/dags/bqetl_glean_usage.py	2025-12-09 15:53:25.000000000 +0000
@@ -3806,6 +3806,24 @@
             firefox_desktop_derived__clients_last_seen_joined__v1
         )
 
+    firefox_desktop_derived__events_first_seen__v1 = bigquery_etl_query(
+        task_id="firefox_desktop_derived__events_first_seen__v1",
+        destination_table="events_first_seen_v1",
+        dataset_id="firefox_desktop_derived",
+        project_id="moz-fx-data-shared-prod",
+        owner="[email protected]",
+        email=[
+            "[email protected]",
+            "[email protected]",
+            "[email protected]",
+            "[email protected]",
+        ],
+        date_partition_parameter=None,
+        depends_on_past=True,
+        parameters=["submission_date:DATE:{{ds}}"],
+        task_group=task_group_firefox_desktop,
+    )
+
     firefox_desktop_derived__events_stream__v1 = GKEPodOperator(
         task_id="firefox_desktop_derived__events_stream__v1",
         arguments=[
@@ -7614,6 +7632,10 @@
         bigeye__firefox_desktop_derived__metrics_clients_last_seen__v1
     )
 
+    firefox_desktop_derived__events_first_seen__v1.set_upstream(
+        firefox_desktop_derived__events_stream__v1
+    )
+
     firefox_desktop_derived__events_stream__v1.set_upstream(
         wait_for_copy_deduplicate_all
     )
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop: events_first_seen
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived: events_first_seen_v1
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml	2025-12-09 15:50:16.000000000 +0000
@@ -0,0 +1,18 @@
+friendly_name: Events First Seen
+description: |-
+  Captures the earliest date that we observe an event for a particular client in the events_stream_v1 table.
+owners:
+- [email protected]
+- [email protected]
+labels:
+  owner1: kbammarito
+  owner2: vsabino
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  view.sql:
+  - moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1
+require_column_descriptions: false
+level: null
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql	2025-12-09 15:47:26.000000000 +0000
@@ -0,0 +1,8 @@
+-- Generated via bigquery_etl.glean_usage
+CREATE OR REPLACE VIEW
+  `moz-fx-data-shared-prod.firefox_desktop.events_first_seen`
+AS
+SELECT
+  *
+FROM
+  `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml	2025-12-09 15:50:18.000000000 +0000
@@ -0,0 +1,42 @@
+friendly_name: Events First Seen
+description: |-
+  Captures the earliest date that we observe an event for a particular client in the events_stream_v1 table.
+owners:
+- [email protected]
+- [email protected]
+labels:
+  schedule: daily
+  table_type: client_level
+  dag: bqetl_glean_usage
+  owner1: kbammarito
+  owner2: vsabino
+scheduling:
+  dag_name: bqetl_glean_usage
+  task_group: firefox_desktop
+  parameters:
+  - submission_date:DATE:{{ds}}
+  depends_on_past: true
+  date_partition_parameter: null
+bigquery:
+  time_partitioning:
+    type: day
+    field: event_first_seen_date
+    require_partition_filter: false
+    expiration_days: null
+  range_partitioning: null
+  clustering:
+    fields:
+    - event_category
+    - normalized_channel
+    - sample_id
+    - normalized_country_code
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  query.sql:
+  - moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1
+  - moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1
+require_column_descriptions: false
+level: null
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql	2025-12-09 15:47:26.000000000 +0000
@@ -0,0 +1,598 @@
+-- Generated via bigquery_etl.glean_usage
+{% if is_init() %}
+  (
+    WITH eventsstream AS (
+      SELECT
+        DATE(MIN(submission_timestamp)) AS submission_date,
+        DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        CAST(NULL AS string) AS criteria,
+        min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+        min_by(sample_id, submission_timestamp) AS sample_id,
+        MIN(submission_timestamp) AS first_submission_timestamp,
+        MIN(event_timestamp) AS first_event_timestamp,
+        min_by(event_extra, submission_timestamp) AS event_extra,
+        min_by(app_version_major, submission_timestamp) AS app_version_major,
+        min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+        min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+        min_by(normalized_os, submission_timestamp) AS normalized_os,
+        min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+      FROM
+        `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+      WHERE
+        -- initialize by looking over all of history
+        DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+        AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+        AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+        AND (TRUE)
+      GROUP BY
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        criteria
+    )
+    SELECT
+      *
+    FROM
+      eventsstream
+  )
+  UNION ALL
+    (
+      WITH eventsstream AS (
+        SELECT
+          DATE(MIN(submission_timestamp)) AS submission_date,
+          DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'chatbot_usage' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+        -- initialize by looking over all of history
+          DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.chatbot'
+            AND event_name = 'sidebar_toggle'
+            AND JSON_VALUE(event_extra.provider) <> 'none'
+          )
+        GROUP BY
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      )
+      SELECT
+        *
+      FROM
+        eventsstream
+    )
+  UNION ALL
+    (
+      WITH eventsstream AS (
+        SELECT
+          DATE(MIN(submission_timestamp)) AS submission_date,
+          DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'smart_tabgroup_save' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+        -- initialize by looking over all of history
+          DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'tabgroup'
+            AND (
+              (event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%')
+              OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')
+            )
+          )
+        GROUP BY
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      )
+      SELECT
+        *
+      FROM
+        eventsstream
+    )
+  UNION ALL
+    (
+      WITH eventsstream AS (
+        SELECT
+          DATE(MIN(submission_timestamp)) AS submission_date,
+          DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'linkpreview_ai_consent' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+        -- initialize by looking over all of history
+          DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.linkpreview'
+            AND event_name = 'card_ai_consent'
+            AND JSON_VALUE(event_extra.option) = 'continue'
+          )
+        GROUP BY
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      )
+      SELECT
+        *
+      FROM
+        eventsstream
+    )
+{% else %}
+  (
+    WITH _current AS (
+      SELECT
+        @submission_date AS submission_date,
+        @submission_date AS event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        CAST(NULL AS string) AS criteria,
+        min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+        min_by(sample_id, submission_timestamp) AS sample_id,
+        MIN(submission_timestamp) AS first_submission_timestamp,
+        MIN(event_timestamp) AS first_event_timestamp,
+        min_by(event_extra, submission_timestamp) AS event_extra,
+        min_by(app_version_major, submission_timestamp) AS app_version_major,
+        min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+        min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+        min_by(normalized_os, submission_timestamp) AS normalized_os,
+        min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+      FROM
+        `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+      WHERE
+        DATE(submission_timestamp) = @submission_date
+        AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+        AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+        AND (TRUE)
+      GROUP BY
+        submission_date,
+        event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        criteria
+    ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+    _previous AS (
+      SELECT
+        submission_date,
+        event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        CAST(NULL AS string) AS criteria,
+        profile_group_id,
+        sample_id,
+        first_submission_timestamp,
+        first_event_timestamp,
+        event_extra,
+        app_version_major,
+        normalized_channel,
+        normalized_country_code,
+        normalized_os,
+        normalized_os_version
+      FROM
+        `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+      WHERE
+        event_first_seen_date > '2023-01-01'
+        AND event_first_seen_date < @submission_date
+    ),
+    _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+      SELECT
+        IF(
+          _previous.client_id IS NULL
+          OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+          _current,
+          _previous
+        ).*
+      FROM
+        _current
+      FULL OUTER JOIN
+        _previous
+        ON _current.client_id = _previous.client_id
+        AND _current.event = _previous.event
+        AND (
+          _current.criteria = _previous.criteria
+          OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+        )
+    )
+    SELECT
+      *
+    FROM
+      _joined
+  )
+  UNION ALL
+    (
+      WITH _current AS (
+        SELECT
+          @submission_date AS submission_date,
+          @submission_date AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'chatbot_usage' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+          DATE(submission_timestamp) = @submission_date
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.chatbot'
+            AND event_name = 'sidebar_toggle'
+            AND JSON_VALUE(event_extra.provider) <> 'none'
+          )
+        GROUP BY
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+      _previous AS (
+        SELECT
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'chatbot_usage' AS criteria,
+          profile_group_id,
+          sample_id,
+          first_submission_timestamp,
+          first_event_timestamp,
+          event_extra,
+          app_version_major,
+          normalized_channel,
+          normalized_country_code,
+          normalized_os,
+          normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+        WHERE
+          event_first_seen_date > '2023-01-01'
+          AND event_first_seen_date < @submission_date
+      ),
+      _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+        SELECT
+          IF(
+            _previous.client_id IS NULL
+            OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+            _current,
+            _previous
+          ).*
+        FROM
+          _current
+        FULL OUTER JOIN
+          _previous
+          ON _current.client_id = _previous.client_id
+          AND _current.event = _previous.event
+          AND (
+            _current.criteria = _previous.criteria
+            OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+          )
+      )
+      SELECT
+        *
+      FROM
+        _joined
+    )
+  UNION ALL
+    (
+      WITH _current AS (
+        SELECT
+          @submission_date AS submission_date,
+          @submission_date AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'smart_tabgroup_save' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+          DATE(submission_timestamp) = @submission_date
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'tabgroup'
+            AND (
+              (event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%')
+              OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')
+            )
+          )
+        GROUP BY
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+      _previous AS (
+        SELECT
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'smart_tabgroup_save' AS criteria,
+          profile_group_id,
+          sample_id,
+          first_submission_timestamp,
+          first_event_timestamp,
+          event_extra,
+          app_version_major,
+          normalized_channel,
+          normalized_country_code,
+          normalized_os,
+          normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+        WHERE
+          event_first_seen_date > '2023-01-01'
+          AND event_first_seen_date < @submission_date
+      ),
+      _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+        SELECT
+          IF(
+            _previous.client_id IS NULL
+            OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+            _current,
+            _previous
+          ).*
+        FROM
+          _current
+        FULL OUTER JOIN
+          _previous
+          ON _current.client_id = _previous.client_id
+          AND _current.event = _previous.event
+          AND (
+            _current.criteria = _previous.criteria
+            OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+          )
+      )
+      SELECT
+        *
+      FROM
+        _joined
+    )
+  UNION ALL
+    (
+      WITH _current AS (
+        SELECT
+          @submission_date AS submission_date,
+          @submission_date AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'linkpreview_ai_consent' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+          DATE(submission_timestamp) = @submission_date
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.linkpreview'
+            AND event_name = 'card_ai_consent'
+            AND JSON_VALUE(event_extra.option) = 'continue'
+          )
+        GROUP BY
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+      _previous AS (
+        SELECT
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'linkpreview_ai_consent' AS criteria,
+          profile_group_id,
+          sample_id,
+          first_submission_timestamp,
+          first_event_timestamp,
+          event_extra,
+          app_version_major,
+          normalized_channel,
+          normalized_country_code,
+          normalized_os,
+          normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+        WHERE
+          event_first_seen_date > '2023-01-01'
+          AND event_first_seen_date < @submission_date
+      ),
+      _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+        SELECT
+          IF(
+            _previous.client_id IS NULL
+            OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+            _current,
+            _previous
+          ).*
+        FROM
+          _current
+        FULL OUTER JOIN
+          _previous
+          ON _current.client_id = _previous.client_id
+          AND _current.event = _previous.event
+          AND (
+            _current.criteria = _previous.criteria
+            OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+          )
+      )
+      SELECT
+        *
+      FROM
+        _joined
+    )
+{% endif %}
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml	2025-12-09 15:47:26.000000000 +0000
@@ -0,0 +1,52 @@
+fields:
+  - mode: NULLABLE
+    name: submission_date
+    type: DATE
+  - mode: NULLABLE
+    name: event_first_seen_date
+    type: DATE
+  - mode: NULLABLE
+    name: client_id
+    type: STRING
+  - mode: NULLABLE
+    name: event
+    type: STRING
+  - mode: NULLABLE
+    name: event_category
+    type: STRING
+  - mode: NULLABLE
+    name: event_name
+    type: STRING
+  - mode: NULLABLE
+    name: criteria
+    type: STRING
+  - mode: NULLABLE
+    name: profile_group_id
+    type: STRING
+  - mode: NULLABLE
+    name: sample_id
+    type: INTEGER
+  - mode: NULLABLE
+    name: first_submission_timestamp
+    type: TIMESTAMP
+  - mode: NULLABLE
+    name: first_event_timestamp
+    type: TIMESTAMP
+  - mode: NULLABLE
+    name: event_extra
+    type: JSON
+  - mode: NULLABLE
+    name: app_version_major
+    type: NUMERIC
+  - mode: NULLABLE
+    name: normalized_channel
+    type: STRING
+  - mode: NULLABLE
+    name: normalized_country_code
+    type: STRING
+  - mode: NULLABLE
+    name: normalized_os
+    type: STRING
+  - mode: NULLABLE
+    name: normalized_os_version
+    type: STRING

Link to full diff

sean-rose · 2025-12-09T20:00:00Z

sql_generators/glean_usage/templates/events_first_seen_templating.yaml

+apps:
+  firefox_desktop:
+    criteria:
+      - name: CAST(NULL as string)


suggestions (non-blocking):

I'd advise against using null as a special criteria name value here because nulls are harder to work with and easy to misinterpret. Instead I'd suggest using some more meaningful name like any_event (or even a less scrutable special value like * would be better than null IMO).

I'd also advise against using raw SQL syntax for the criteria names, as that's making it difficult to configure things correctly (as evidenced by the comment above about how to YAML-encode the SQL strings for criteria names, which IMO ideally shouldn't be necessary).

Even if you decide you want to stick with using null as a special criteria name value, you could encode that as a YAML null literal here and then do something like {{ ("'" ~ item["name"] ~ "'") if item["name"] else "CAST(NULL AS STRING)" }} AS criteria in the ETL query template.

And if you accept both of those suggestions you could also encode the criteria in a more natural way as a dictionary, like:

apps: <app name>: criteria: <criteria name>: | <criteria SQL>

(in which case you'd want to loop over the criteria in the ETL query template like {% for criteria_name, criteria_sql in criteria.items() %})

sean-rose · 2025-12-09T20:11:15Z

sql_generators/glean_usage/templates/events_first_seen_templating.yaml

+      - name: "'chatbot_usage'"
+        sql: event_category = 'genai.chatbot' AND event_name = 'sidebar_toggle' AND JSON_VALUE(event_extra.provider) <> 'none'
+      - name: "'smart_tabgroup_save'"
+        sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save'))


suggestion (non-blocking): IMO these criteria SQL values would be easier to read/edit if they were formatted on multiple lines in YAML block scalars, like:

Suggested change

sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save'))

sql: |

event_category = 'tabgroup'

AND (

(event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%')

OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')

)

(if you accept this suggestion I'd recommend consistently using YAML block scalars for all criteria SQL values, not just the very long ones)

sean-rose · 2025-12-09T22:10:52Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+WITH eventsstream AS (
+  SELECT
+  DATE(MIN(submission_timestamp)) as submission_date,
+  DATE(MIN(submission_timestamp)) as event_first_seen_date,


questions (non-blocking):

Why have submission_date and event_first_seen_date columns with the exact same value?

Why have these date columns when you have the more precise first_submission_timestamp column?

sean-rose · 2025-12-09T22:13:56Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+  `event`,
+  event_category,
+  event_name,


question (blocking): Are you absolutely certain you want to group by the event/category/name?

Some of the configured criteria could match multiple types of events, so grouping by event/category/name means you could get multiple rows per client for those criteria (one for each distinct event type that matches the criteria's conditions).

sean-rose · 2025-12-09T22:50:09Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+  min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+  min_by(sample_id, submission_timestamp) AS sample_id,
+  MIN(submission_timestamp) AS first_submission_timestamp,
+  MIN(event_timestamp) AS first_event_timestamp,
+  min_by(event_extra, submission_timestamp) AS event_extra,
+  min_by(app_version_major, submission_timestamp) AS app_version_major,
+  min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+  min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+  min_by(normalized_os, submission_timestamp) AS normalized_os,
+  min_by(normalized_os_version, submission_timestamp) AS normalized_os_version


issues (blocking):

Events submitted in the same events ping will all have the same submission_timestamp, so there could easily be ties based on submission_timestamp values and these MIN_BY() calls might not end up selecting the properties for the actual first such event.

Even if you also compare using event_timestamp to break submission_timestamp ties it's currently impossible to guarantee there won't be any ties, as events could also have the same event_timestamp (though my planned solution for DENG-9800 might eventually help with this). And if ties happen then the multiple MIN_BY() calls here aren't necessarily guaranteed to all end up getting their data from the same event record.

So I'd suggest using event_timestamp in the comparison and getting all the necessary data in a single aggregate call. Unfortunately, event_timestamp can have some wild values (e.g. far in the past or future) so a simple COALESCE(event_timestamp, submission_timestamp) probably isn't safe to use, and MIN_BY() doesn't allow comparing multiple values. However, you could do something like this:

ARRAY_AGG( STRUCT( profile_group_id, sample_id, submission_timestamp AS first_submission_timestamp, event_timestamp AS first_event_timestamp, event_extra, app_version_major, normalized_channel, normalized_country_code, normalized_os, normalized_os_version ) ORDER BY submission_timestamp, event_timestamp NULLS LAST LIMIT 1 )[0].*

sean-rose · 2025-12-09T23:11:32Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+        -- initialize by looking over all of history
+  DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size


issue (possibly blocking): As it stands the init query for firefox_desktop_derived.events_first_seen_v1 is going to be scanning petabytes of data, so I suspect having the init query run per sample ID may be necessary to avoid having that take an exceedingly long time (blocking the artifact deployment process, and possibly even hitting the 6-hour query runtime limit).

@BenWu what do you think? Any advice on how to estimate how long such a huge query is likely to take?

sean-rose · 2025-12-09T23:13:45Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+  DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+  AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')


suggestion (non-blocking): It'd be helpful to have a comment explaining why these particular event categories are being excluded.

sean-rose · 2025-12-09T23:24:17Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+  AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null


suggestions (non-blocking):

This comment explains what is being done, but we could already tell that by looking at the code in question. Better would be having a comment that explains why this is only including Firefox Desktop events that have profile_group_id values.

In any case, the comment should go within the {% if app_id == 'firefox_desktop' -%} block (otherwise the generated ETL queries for other apps will confusingly contain the comment but not the associated SQL code).

sean-rose · 2025-12-10T00:04:39Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+  min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+  min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+  min_by(normalized_os, submission_timestamp) AS normalized_os,
+  min_by(normalized_os_version, submission_timestamp) AS normalized_os_version


suggestion (non-blocking): Apparently normalized_os_version ends up being 10.0 for both Windows 10 and 11 (though IMO that seems like a bug we should fix), so I'd suggest also including client_info.windows_build_number to allow differentiating between Windows 10 and 11 (e.g. DENG-8570).

sean-rose · 2025-12-10T00:07:12Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

+          AND _current.event = _previous.event
+          AND (_current.criteria = _previous.criteria
+              OR (_current.criteria IS NULL AND _previous.criteria IS NULL))


If you decide not to group by event/category/name then the event join condition here should be removed.

If you decide to avoid using null criteria name values then the criteria join condition here could be simplified.

sean-rose · 2025-12-10T00:28:29Z

sql_generators/glean_usage/templates/events_first_seen_v1.query.sql

suggestions (non-blocking):

Formatting this file in a more bqetl format-like way would help with readability.

There's a fair amount of duplicated logic between the init and non-init parts of the SQL which would be good to avoid if possible.

sean-rose · 2025-12-10T00:31:39Z

sql_generators/glean_usage/templates/events_first_seen_v1.schema.yaml

+  - mode: NULLABLE
+    name: submission_date
+    type: DATE


nitpick: IMO it's good to keep the schema field attributes consistently in the order name, type, mode, description, fields for readability, rather than having the attributes be sorted alphabetically like this.

sean-rose · 2025-12-10T00:36:14Z

sql_generators/glean_usage/events_first_seen.py

+        self.per_app_id_enabled = True
+        self.across_apps_enabled = False
+        self.cross_channel_template = "cross_channel_events_first_seen.view.sql"
+        self.base_table_name = "events_v1"


suggestion (non-blocking): While this doesn't really matter since you're using an opt-in approach, it would be more accurate to define the base table as events_stream_v1.

Suggested change

self.base_table_name = "events_v1"

self.base_table_name = "events_stream_v1"

sean-rose · 2025-12-10T00:47:14Z

sql_generators/glean_usage/events_first_seen.py

+                parallelism=parallelism,
+                id_token=id_token,
+                custom_render_kwargs={
+                    "app_id": app_id_info["bq_dataset_family"],


suggestion (non-blocking): IMO calling this value app_id is a bit confusing/misleading, so I'd suggest changing this to something like app_id_dataset (and updating the associated Jinja).

kbammarito added 2 commits September 3, 2025 13:23

add events_first_seen to existing architecture

2616a34

add events_first_seen files

c9a3397

kbammarito requested review from alekhyamoz, bani and scholtzan September 3, 2025 17:35

scholtzan reviewed Sep 3, 2025

View reviewed changes

sql_generators/glean_usage/events_first_seen.py Outdated Show resolved Hide resolved

scholtzan reviewed Sep 3, 2025

View reviewed changes

sql_generators/glean_usage/templates/events_first_seen_v1.metadata.yaml Outdated Show resolved Hide resolved

Update sql_generators/glean_usage/templates/events_first_seen_v1.meta…

2bec78f

…data.yaml Co-authored-by: Anna Scholtz <[email protected]>