Skip to content

Conversation

@kbammarito
Copy link
Contributor

@kbammarito kbammarito commented Sep 3, 2025

Description

This PR adds Events First Seen into the Glean Usage SQL generator.

Related Tickets & Documents

Reviewer, please follow this checklist

@scholtzan
Copy link
Collaborator

I think CI might be failing because in the generated metadata.yaml file the task_group isn't correctly resolved:

scheduling:
  dag_name: bqetl_glean_usage
  task_group: { { app_name } }
  date_partition_parameter: null

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@kbammarito kbammarito requested a review from scholtzan September 5, 2025 20:46
@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot
Copy link

Integration report for "Merge branch 'main' into DENG-8767-create-events-first-seen-table"

sql.diff

Click to expand!
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_glean_usage.py /tmp/workspace/generated-sql/dags/bqetl_glean_usage.py
--- /tmp/workspace/main-generated-sql/dags/bqetl_glean_usage.py	2025-12-09 15:51:29.000000000 +0000
+++ /tmp/workspace/generated-sql/dags/bqetl_glean_usage.py	2025-12-09 15:53:25.000000000 +0000
@@ -3806,6 +3806,24 @@
             firefox_desktop_derived__clients_last_seen_joined__v1
         )
 
+    firefox_desktop_derived__events_first_seen__v1 = bigquery_etl_query(
+        task_id="firefox_desktop_derived__events_first_seen__v1",
+        destination_table="events_first_seen_v1",
+        dataset_id="firefox_desktop_derived",
+        project_id="moz-fx-data-shared-prod",
+        owner="[email protected]",
+        email=[
+            "[email protected]",
+            "[email protected]",
+            "[email protected]",
+            "[email protected]",
+        ],
+        date_partition_parameter=None,
+        depends_on_past=True,
+        parameters=["submission_date:DATE:{{ds}}"],
+        task_group=task_group_firefox_desktop,
+    )
+
     firefox_desktop_derived__events_stream__v1 = GKEPodOperator(
         task_id="firefox_desktop_derived__events_stream__v1",
         arguments=[
@@ -7614,6 +7632,10 @@
         bigeye__firefox_desktop_derived__metrics_clients_last_seen__v1
     )
 
+    firefox_desktop_derived__events_first_seen__v1.set_upstream(
+        firefox_desktop_derived__events_stream__v1
+    )
+
     firefox_desktop_derived__events_stream__v1.set_upstream(
         wait_for_copy_deduplicate_all
     )
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop: events_first_seen
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived: events_first_seen_v1
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/metadata.yaml	2025-12-09 15:50:16.000000000 +0000
@@ -0,0 +1,18 @@
+friendly_name: Events First Seen
+description: |-
+  Captures the earliest date that we observe an event for a particular client in the events_stream_v1 table.
+owners:
+- [email protected]
+- [email protected]
+labels:
+  owner1: kbammarito
+  owner2: vsabino
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  view.sql:
+  - moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1
+require_column_descriptions: false
+level: null
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/events_first_seen/view.sql	2025-12-09 15:47:26.000000000 +0000
@@ -0,0 +1,8 @@
+-- Generated via bigquery_etl.glean_usage
+CREATE OR REPLACE VIEW
+  `moz-fx-data-shared-prod.firefox_desktop.events_first_seen`
+AS
+SELECT
+  *
+FROM
+  `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/metadata.yaml	2025-12-09 15:50:18.000000000 +0000
@@ -0,0 +1,42 @@
+friendly_name: Events First Seen
+description: |-
+  Captures the earliest date that we observe an event for a particular client in the events_stream_v1 table.
+owners:
+- [email protected]
+- [email protected]
+labels:
+  schedule: daily
+  table_type: client_level
+  dag: bqetl_glean_usage
+  owner1: kbammarito
+  owner2: vsabino
+scheduling:
+  dag_name: bqetl_glean_usage
+  task_group: firefox_desktop
+  parameters:
+  - submission_date:DATE:{{ds}}
+  depends_on_past: true
+  date_partition_parameter: null
+bigquery:
+  time_partitioning:
+    type: day
+    field: event_first_seen_date
+    require_partition_filter: false
+    expiration_days: null
+  range_partitioning: null
+  clustering:
+    fields:
+    - event_category
+    - normalized_channel
+    - sample_id
+    - normalized_country_code
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  query.sql:
+  - moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1
+  - moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1
+require_column_descriptions: false
+level: null
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/query.sql	2025-12-09 15:47:26.000000000 +0000
@@ -0,0 +1,598 @@
+-- Generated via bigquery_etl.glean_usage
+{% if is_init() %}
+  (
+    WITH eventsstream AS (
+      SELECT
+        DATE(MIN(submission_timestamp)) AS submission_date,
+        DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        CAST(NULL AS string) AS criteria,
+        min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+        min_by(sample_id, submission_timestamp) AS sample_id,
+        MIN(submission_timestamp) AS first_submission_timestamp,
+        MIN(event_timestamp) AS first_event_timestamp,
+        min_by(event_extra, submission_timestamp) AS event_extra,
+        min_by(app_version_major, submission_timestamp) AS app_version_major,
+        min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+        min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+        min_by(normalized_os, submission_timestamp) AS normalized_os,
+        min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+      FROM
+        `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+      WHERE
+        -- initialize by looking over all of history
+        DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+        AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+        AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+        AND (TRUE)
+      GROUP BY
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        criteria
+    )
+    SELECT
+      *
+    FROM
+      eventsstream
+  )
+  UNION ALL
+    (
+      WITH eventsstream AS (
+        SELECT
+          DATE(MIN(submission_timestamp)) AS submission_date,
+          DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'chatbot_usage' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+        -- initialize by looking over all of history
+          DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.chatbot'
+            AND event_name = 'sidebar_toggle'
+            AND JSON_VALUE(event_extra.provider) <> 'none'
+          )
+        GROUP BY
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      )
+      SELECT
+        *
+      FROM
+        eventsstream
+    )
+  UNION ALL
+    (
+      WITH eventsstream AS (
+        SELECT
+          DATE(MIN(submission_timestamp)) AS submission_date,
+          DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'smart_tabgroup_save' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+        -- initialize by looking over all of history
+          DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'tabgroup'
+            AND (
+              (event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%')
+              OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')
+            )
+          )
+        GROUP BY
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      )
+      SELECT
+        *
+      FROM
+        eventsstream
+    )
+  UNION ALL
+    (
+      WITH eventsstream AS (
+        SELECT
+          DATE(MIN(submission_timestamp)) AS submission_date,
+          DATE(MIN(submission_timestamp)) AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'linkpreview_ai_consent' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+        -- initialize by looking over all of history
+          DATE(submission_timestamp) >= '2023-01-01'
+        -- AND sample_id >= @sample_id
+        -- AND sample_id < @sample_id + @sampling_batch_size
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.linkpreview'
+            AND event_name = 'card_ai_consent'
+            AND JSON_VALUE(event_extra.option) = 'continue'
+          )
+        GROUP BY
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      )
+      SELECT
+        *
+      FROM
+        eventsstream
+    )
+{% else %}
+  (
+    WITH _current AS (
+      SELECT
+        @submission_date AS submission_date,
+        @submission_date AS event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        CAST(NULL AS string) AS criteria,
+        min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+        min_by(sample_id, submission_timestamp) AS sample_id,
+        MIN(submission_timestamp) AS first_submission_timestamp,
+        MIN(event_timestamp) AS first_event_timestamp,
+        min_by(event_extra, submission_timestamp) AS event_extra,
+        min_by(app_version_major, submission_timestamp) AS app_version_major,
+        min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+        min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+        min_by(normalized_os, submission_timestamp) AS normalized_os,
+        min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+      FROM
+        `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+      WHERE
+        DATE(submission_timestamp) = @submission_date
+        AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+        AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+        AND (TRUE)
+      GROUP BY
+        submission_date,
+        event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        criteria
+    ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+    _previous AS (
+      SELECT
+        submission_date,
+        event_first_seen_date,
+        client_id,
+        `event`,
+        event_category,
+        event_name,
+        CAST(NULL AS string) AS criteria,
+        profile_group_id,
+        sample_id,
+        first_submission_timestamp,
+        first_event_timestamp,
+        event_extra,
+        app_version_major,
+        normalized_channel,
+        normalized_country_code,
+        normalized_os,
+        normalized_os_version
+      FROM
+        `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+      WHERE
+        event_first_seen_date > '2023-01-01'
+        AND event_first_seen_date < @submission_date
+    ),
+    _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+      SELECT
+        IF(
+          _previous.client_id IS NULL
+          OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+          _current,
+          _previous
+        ).*
+      FROM
+        _current
+      FULL OUTER JOIN
+        _previous
+        ON _current.client_id = _previous.client_id
+        AND _current.event = _previous.event
+        AND (
+          _current.criteria = _previous.criteria
+          OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+        )
+    )
+    SELECT
+      *
+    FROM
+      _joined
+  )
+  UNION ALL
+    (
+      WITH _current AS (
+        SELECT
+          @submission_date AS submission_date,
+          @submission_date AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'chatbot_usage' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+          DATE(submission_timestamp) = @submission_date
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.chatbot'
+            AND event_name = 'sidebar_toggle'
+            AND JSON_VALUE(event_extra.provider) <> 'none'
+          )
+        GROUP BY
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+      _previous AS (
+        SELECT
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'chatbot_usage' AS criteria,
+          profile_group_id,
+          sample_id,
+          first_submission_timestamp,
+          first_event_timestamp,
+          event_extra,
+          app_version_major,
+          normalized_channel,
+          normalized_country_code,
+          normalized_os,
+          normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+        WHERE
+          event_first_seen_date > '2023-01-01'
+          AND event_first_seen_date < @submission_date
+      ),
+      _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+        SELECT
+          IF(
+            _previous.client_id IS NULL
+            OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+            _current,
+            _previous
+          ).*
+        FROM
+          _current
+        FULL OUTER JOIN
+          _previous
+          ON _current.client_id = _previous.client_id
+          AND _current.event = _previous.event
+          AND (
+            _current.criteria = _previous.criteria
+            OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+          )
+      )
+      SELECT
+        *
+      FROM
+        _joined
+    )
+  UNION ALL
+    (
+      WITH _current AS (
+        SELECT
+          @submission_date AS submission_date,
+          @submission_date AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'smart_tabgroup_save' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+          DATE(submission_timestamp) = @submission_date
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'tabgroup'
+            AND (
+              (event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%')
+              OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')
+            )
+          )
+        GROUP BY
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+      _previous AS (
+        SELECT
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'smart_tabgroup_save' AS criteria,
+          profile_group_id,
+          sample_id,
+          first_submission_timestamp,
+          first_event_timestamp,
+          event_extra,
+          app_version_major,
+          normalized_channel,
+          normalized_country_code,
+          normalized_os,
+          normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+        WHERE
+          event_first_seen_date > '2023-01-01'
+          AND event_first_seen_date < @submission_date
+      ),
+      _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+        SELECT
+          IF(
+            _previous.client_id IS NULL
+            OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+            _current,
+            _previous
+          ).*
+        FROM
+          _current
+        FULL OUTER JOIN
+          _previous
+          ON _current.client_id = _previous.client_id
+          AND _current.event = _previous.event
+          AND (
+            _current.criteria = _previous.criteria
+            OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+          )
+      )
+      SELECT
+        *
+      FROM
+        _joined
+    )
+  UNION ALL
+    (
+      WITH _current AS (
+        SELECT
+          @submission_date AS submission_date,
+          @submission_date AS event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'linkpreview_ai_consent' AS criteria,
+          min_by(profile_group_id, submission_timestamp) AS profile_group_id,
+          min_by(sample_id, submission_timestamp) AS sample_id,
+          MIN(submission_timestamp) AS first_submission_timestamp,
+          MIN(event_timestamp) AS first_event_timestamp,
+          min_by(event_extra, submission_timestamp) AS event_extra,
+          min_by(app_version_major, submission_timestamp) AS app_version_major,
+          min_by(normalized_channel, submission_timestamp) AS normalized_channel,
+          min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
+          min_by(normalized_os, submission_timestamp) AS normalized_os,
+          min_by(normalized_os_version, submission_timestamp) AS normalized_os_version,
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_stream_v1`
+        WHERE
+          DATE(submission_timestamp) = @submission_date
+          AND event_category NOT IN (
+            'media.playback',
+            'nimbus_events',
+            'uptake.remotecontent.result'
+          )
+        -- if app_id is firefox_desktop, filter for where profile_group_id is not null
+          AND profile_group_id IS NOT NULL
+        -- below is the templated criteria
+          AND (
+            event_category = 'genai.linkpreview'
+            AND event_name = 'card_ai_consent'
+            AND JSON_VALUE(event_extra.option) = 'continue'
+          )
+        GROUP BY
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          criteria
+      ),
+  -- query over all of history to see whether the client_id, event and criteria combination has shown up before
+      _previous AS (
+        SELECT
+          submission_date,
+          event_first_seen_date,
+          client_id,
+          `event`,
+          event_category,
+          event_name,
+          'linkpreview_ai_consent' AS criteria,
+          profile_group_id,
+          sample_id,
+          first_submission_timestamp,
+          first_event_timestamp,
+          event_extra,
+          app_version_major,
+          normalized_channel,
+          normalized_country_code,
+          normalized_os,
+          normalized_os_version
+        FROM
+          `moz-fx-data-shared-prod.firefox_desktop_derived.events_first_seen_v1`
+        WHERE
+          event_first_seen_date > '2023-01-01'
+          AND event_first_seen_date < @submission_date
+      ),
+      _joined AS (
+    --switch to using separate if statements instead of 1
+    --because dry run is struggling to validate the final struct
+        SELECT
+          IF(
+            _previous.client_id IS NULL
+            OR _previous.event_first_seen_date >= _current.event_first_seen_date,
+            _current,
+            _previous
+          ).*
+        FROM
+          _current
+        FULL OUTER JOIN
+          _previous
+          ON _current.client_id = _previous.client_id
+          AND _current.event = _previous.event
+          AND (
+            _current.criteria = _previous.criteria
+            OR (_current.criteria IS NULL AND _previous.criteria IS NULL)
+          )
+      )
+      SELECT
+        *
+      FROM
+        _joined
+    )
+{% endif %}
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/events_first_seen_v1/schema.yaml	2025-12-09 15:47:26.000000000 +0000
@@ -0,0 +1,52 @@
+fields:
+  - mode: NULLABLE
+    name: submission_date
+    type: DATE
+  - mode: NULLABLE
+    name: event_first_seen_date
+    type: DATE
+  - mode: NULLABLE
+    name: client_id
+    type: STRING
+  - mode: NULLABLE
+    name: event
+    type: STRING
+  - mode: NULLABLE
+    name: event_category
+    type: STRING
+  - mode: NULLABLE
+    name: event_name
+    type: STRING
+  - mode: NULLABLE
+    name: criteria
+    type: STRING
+  - mode: NULLABLE
+    name: profile_group_id
+    type: STRING
+  - mode: NULLABLE
+    name: sample_id
+    type: INTEGER
+  - mode: NULLABLE
+    name: first_submission_timestamp
+    type: TIMESTAMP
+  - mode: NULLABLE
+    name: first_event_timestamp
+    type: TIMESTAMP
+  - mode: NULLABLE
+    name: event_extra
+    type: JSON
+  - mode: NULLABLE
+    name: app_version_major
+    type: NUMERIC
+  - mode: NULLABLE
+    name: normalized_channel
+    type: STRING
+  - mode: NULLABLE
+    name: normalized_country_code
+    type: STRING
+  - mode: NULLABLE
+    name: normalized_os
+    type: STRING
+  - mode: NULLABLE
+    name: normalized_os_version
+    type: STRING

Link to full diff

@kbammarito kbammarito requested a review from sean-rose December 9, 2025 16:09
apps:
firefox_desktop:
criteria:
- name: CAST(NULL as string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions (non-blocking):

  • I'd advise against using null as a special criteria name value here because nulls are harder to work with and easy to misinterpret. Instead I'd suggest using some more meaningful name like any_event (or even a less scrutable special value like * would be better than null IMO).
  • I'd also advise against using raw SQL syntax for the criteria names, as that's making it difficult to configure things correctly (as evidenced by the comment above about how to YAML-encode the SQL strings for criteria names, which IMO ideally shouldn't be necessary).
    • Even if you decide you want to stick with using null as a special criteria name value, you could encode that as a YAML null literal here and then do something like {{ ("'" ~ item["name"] ~ "'") if item["name"] else "CAST(NULL AS STRING)" }} AS criteria in the ETL query template.

And if you accept both of those suggestions you could also encode the criteria in a more natural way as a dictionary, like:

apps:
  <app name>:
    criteria:
      <criteria name>: |
        <criteria SQL>

(in which case you'd want to loop over the criteria in the ETL query template like {% for criteria_name, criteria_sql in criteria.items() %})

- name: "'chatbot_usage'"
sql: event_category = 'genai.chatbot' AND event_name = 'sidebar_toggle' AND JSON_VALUE(event_extra.provider) <> 'none'
- name: "'smart_tabgroup_save'"
sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): IMO these criteria SQL values would be easier to read/edit if they were formatted on multiple lines in YAML block scalars, like:

Suggested change
sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save'))
sql: |
event_category = 'tabgroup'
AND (
(event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%')
OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')
)

(if you accept this suggestion I'd recommend consistently using YAML block scalars for all criteria SQL values, not just the very long ones)

WITH eventsstream AS (
SELECT
DATE(MIN(submission_timestamp)) as submission_date,
DATE(MIN(submission_timestamp)) as event_first_seen_date,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questions (non-blocking):

  • Why have submission_date and event_first_seen_date columns with the exact same value?
  • Why have these date columns when you have the more precise first_submission_timestamp column?

Comment on lines +45 to +47
`event`,
event_category,
event_name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Are you absolutely certain you want to group by the event/category/name?

Some of the configured criteria could match multiple types of events, so grouping by event/category/name means you could get multiple rows per client for those criteria (one for each distinct event type that matches the criteria's conditions).

Comment on lines +19 to +28
min_by(profile_group_id, submission_timestamp) AS profile_group_id,
min_by(sample_id, submission_timestamp) AS sample_id,
MIN(submission_timestamp) AS first_submission_timestamp,
MIN(event_timestamp) AS first_event_timestamp,
min_by(event_extra, submission_timestamp) AS event_extra,
min_by(app_version_major, submission_timestamp) AS app_version_major,
min_by(normalized_channel, submission_timestamp) AS normalized_channel,
min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
min_by(normalized_os, submission_timestamp) AS normalized_os,
min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues (blocking):

  • Events submitted in the same events ping will all have the same submission_timestamp, so there could easily be ties based on submission_timestamp values and these MIN_BY() calls might not end up selecting the properties for the actual first such event.
  • Even if you also compare using event_timestamp to break submission_timestamp ties it's currently impossible to guarantee there won't be any ties, as events could also have the same event_timestamp (though my planned solution for DENG-9800 might eventually help with this). And if ties happen then the multiple MIN_BY() calls here aren't necessarily guaranteed to all end up getting their data from the same event record.

So I'd suggest using event_timestamp in the comparison and getting all the necessary data in a single aggregate call. Unfortunately, event_timestamp can have some wild values (e.g. far in the past or future) so a simple COALESCE(event_timestamp, submission_timestamp) probably isn't safe to use, and MIN_BY() doesn't allow comparing multiple values. However, you could do something like this:

  ARRAY_AGG(
    STRUCT(
      profile_group_id,
      sample_id,
      submission_timestamp AS first_submission_timestamp,
      event_timestamp AS first_event_timestamp,
      event_extra,
      app_version_major,
      normalized_channel,
      normalized_country_code,
      normalized_os,
      normalized_os_version
    )
    ORDER BY
      submission_timestamp,
      event_timestamp NULLS LAST
    LIMIT 1
  )[0].*

Comment on lines +32 to +35
-- initialize by looking over all of history
DATE(submission_timestamp) >= '2023-01-01'
-- AND sample_id >= @sample_id
-- AND sample_id < @sample_id + @sampling_batch_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (possibly blocking): As it stands the init query for firefox_desktop_derived.events_first_seen_v1 is going to be scanning petabytes of data, so I suspect having the init query run per sample ID may be necessary to avoid having that take an exceedingly long time (blocking the artifact deployment process, and possibly even hitting the 6-hour query runtime limit).

@BenWu what do you think? Any advice on how to estimate how long such a huge query is likely to take?

DATE(submission_timestamp) >= '2023-01-01'
-- AND sample_id >= @sample_id
-- AND sample_id < @sample_id + @sampling_batch_size
AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): It'd be helpful to have a comment explaining why these particular event categories are being excluded.

-- AND sample_id >= @sample_id
-- AND sample_id < @sample_id + @sampling_batch_size
AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result')
-- if app_id is firefox_desktop, filter for where profile_group_id is not null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions (non-blocking):

  • This comment explains what is being done, but we could already tell that by looking at the code in question. Better would be having a comment that explains why this is only including Firefox Desktop events that have profile_group_id values.
  • In any case, the comment should go within the {% if app_id == 'firefox_desktop' -%} block (otherwise the generated ETL queries for other apps will confusingly contain the comment but not the associated SQL code).

min_by(normalized_channel, submission_timestamp) AS normalized_channel,
min_by(normalized_country_code, submission_timestamp) AS normalized_country_code,
min_by(normalized_os, submission_timestamp) AS normalized_os,
min_by(normalized_os_version, submission_timestamp) AS normalized_os_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Apparently normalized_os_version ends up being 10.0 for both Windows 10 and 11 (though IMO that seems like a bug we should fix), so I'd suggest also including client_info.windows_build_number to allow differentiating between Windows 10 and 11 (e.g. DENG-8570).

Comment on lines +148 to +150
AND _current.event = _previous.event
AND (_current.criteria = _previous.criteria
OR (_current.criteria IS NULL AND _previous.criteria IS NULL))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If you decide not to group by event/category/name then the event join condition here should be removed.
  • If you decide to avoid using null criteria name values then the criteria join condition here could be simplified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions (non-blocking):

  • Formatting this file in a more bqetl format-like way would help with readability.
  • There's a fair amount of duplicated logic between the init and non-init parts of the SQL which would be good to avoid if possible.

Comment on lines +2 to +4
- mode: NULLABLE
name: submission_date
type: DATE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: IMO it's good to keep the schema field attributes consistently in the order name, type, mode, description, fields for readability, rather than having the attributes be sorted alphabetically like this.

self.per_app_id_enabled = True
self.across_apps_enabled = False
self.cross_channel_template = "cross_channel_events_first_seen.view.sql"
self.base_table_name = "events_v1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): While this doesn't really matter since you're using an opt-in approach, it would be more accurate to define the base table as events_stream_v1.

Suggested change
self.base_table_name = "events_v1"
self.base_table_name = "events_stream_v1"

parallelism=parallelism,
id_token=id_token,
custom_render_kwargs={
"app_id": app_id_info["bq_dataset_family"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): IMO calling this value app_id is a bit confusing/misleading, so I'd suggest changing this to something like app_id_dataset (and updating the associated Jinja).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants