Skip to content

Conversation

@LiamMcFall
Copy link
Contributor

Description

This PR creates the report_content table utilizing the underlying live tables. This will allow the content team to view same day reporting of what content is being reported by users in order to remove if necessary.

This daily job was configured following the example in this model where we are partitioning by day but rerunning that same day's data hourly to pull in any new records.

Related Tickets & Documents

Reviewer, please follow this checklist

@LiamMcFall LiamMcFall changed the title Newtab content hourly feat: newtab-content hourly reporting for reported content Dec 3, 2025
@dataops-ci-bot

This comment has been minimized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a corresponding view?

@kik-kik kik-kik requested a review from sean-rose December 4, 2025 16:05
@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot

This comment has been minimized.

@dataops-ci-bot
Copy link

Integration report for "rename model, create view, update dag, remove newtab ping logic"

sql.diff

Click to expand!
Only in /tmp/workspace/generated-sql/dags/: bqetl_newtab_hourly.py
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/dags/bqetl_newtab_hourly.py /tmp/workspace/generated-sql/dags/bqetl_newtab_hourly.py
--- /tmp/workspace/main-generated-sql/dags/bqetl_newtab_hourly.py	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/dags/bqetl_newtab_hourly.py	2025-12-05 22:45:03.000000000 +0000
@@ -0,0 +1,67 @@
+# Generated via https://github.com/mozilla/bigquery-etl/blob/main/bigquery_etl/query_scheduling/generate_airflow_dags.py
+
+from airflow import DAG
+from airflow.sensors.external_task import ExternalTaskMarker
+from airflow.sensors.external_task import ExternalTaskSensor
+from airflow.utils.task_group import TaskGroup
+import datetime
+from operators.gcp_container_operator import GKEPodOperator
+from utils.constants import ALLOWED_STATES, FAILED_STATES
+from utils.gcp import bigquery_etl_query, bigquery_dq_check, bigquery_bigeye_check
+
+docs = """
+### bqetl_newtab_hourly
+
+Built from bigquery-etl repo, [`dags/bqetl_newtab_hourly.py`](https://github.com/mozilla/bigquery-etl/blob/generated-sql/dags/bqetl_newtab_hourly.py)
+
+#### Description
+
+Hourly tables for content reporting
+#### Owner
+
+[email protected]
+
+#### Tags
+
+* impact/tier_2
+* repo/bigquery-etl
+"""
+
+
+default_args = {
+    "owner": "[email protected]",
+    "start_date": datetime.datetime(2025, 12, 5, 0, 0),
+    "end_date": None,
+    "email": [],
+    "depends_on_past": False,
+    "retry_delay": datetime.timedelta(seconds=600),
+    "email_on_failure": True,
+    "email_on_retry": False,
+    "retries": 1,
+    "max_active_tis_per_dag": None,
+}
+
+tags = ["impact/tier_2", "repo/bigquery-etl"]
+
+with DAG(
+    "bqetl_newtab_hourly",
+    default_args=default_args,
+    schedule_interval="@hourly",
+    doc_md=docs,
+    tags=tags,
+    catchup=True,
+) as dag:
+
+    firefox_desktop_derived__newtab_content_reported_content__v1 = bigquery_etl_query(
+        task_id="firefox_desktop_derived__newtab_content_reported_content__v1",
+        destination_table='report_content_live_v1${{ (execution_date - macros.timedelta(hours=1)).strftime("%Y%m%d") }}',
+        dataset_id="firefox_desktop_derived",
+        project_id="moz-fx-data-shared-prod",
+        owner="[email protected]",
+        email=["[email protected]"],
+        date_partition_parameter=None,
+        depends_on_past=False,
+        parameters=[
+            "submission_date:DATE:{{ (execution_date - macros.timedelta(hours=1)).strftime('%Y-%m-%d') }}"
+        ],
+    )
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop: newtab_content_reported_content
Only in /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived: newtab_content_reported_content_v1
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/metadata.yaml	2025-12-05 22:40:10.000000000 +0000
@@ -0,0 +1,19 @@
+friendly_name: Newtab-Content Reported Content
+description: |-
+  This model will be used by the content team to monitor organic content that users have reported on an hourly basis.
+  This will help assist the content team in identifying and removing problematic content faster. Data from the same day
+  will be reprocessed every hour until 1:00am the following day to allow for late arriving data. Therefore, data for that
+  day will first be available at 2:00am.
+owners:
+- [email protected]
+labels:
+  owner1: lmcfall
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  view.sql:
+  - moz-fx-data-shared-prod.firefox_desktop_derived.newtab_content_reported_content_v1
+require_column_descriptions: false
+level: null
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/view.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/view.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/view.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop/newtab_content_reported_content/view.sql	2025-12-05 22:38:42.000000000 +0000
@@ -0,0 +1,7 @@
+CREATE OR REPLACE VIEW
+  `moz-fx-data-shared-prod.firefox_desktop.newtab_content_reported_content`
+AS
+SELECT
+  *
+FROM
+  `moz-fx-data-shared-prod.firefox_desktop_derived.newtab_content_reported_content_v1`
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/metadata.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/metadata.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/metadata.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/metadata.yaml	2025-12-05 22:40:12.000000000 +0000
@@ -0,0 +1,38 @@
+friendly_name: Newtab-Content Reported Content
+description: |-
+  This table will be used by the content team to monitor organic content that users have reported on an hourly basis.
+  This will help assist the content team in identifying and removing problematic content faster. Data from the same day
+  will be reprocessed every hour until 1:00am the following day to allow for late arriving data. Therefore, data for that
+  day will first be available at 2:00am.
+owners:
+- [email protected]
+labels:
+  incremental: true
+  schedule: hourly
+  dag: bqetl_newtab_hourly
+  owner1: lmcfall
+scheduling:
+  dag_name: bqetl_newtab_hourly
+  date_partition_parameter: null
+  destination_table: report_content_live_v1${{ (execution_date - macros.timedelta(hours=1)).strftime("%Y%m%d")
+    }}
+  parameters:
+  - submission_date:DATE:{{ (execution_date - macros.timedelta(hours=1)).strftime('%Y-%m-%d')
+    }}
+bigquery:
+  time_partitioning:
+    type: day
+    field: submission_date
+    require_partition_filter: false
+    expiration_days: null
+  range_partitioning: null
+  clustering: null
+workgroup_access:
+- role: roles/bigquery.dataViewer
+  members:
+  - workgroup:mozilla-confidential
+references:
+  query.sql:
+  - moz-fx-data-shared-prod.firefox_desktop_live.newtab_content_v1
+require_column_descriptions: false
+level: null
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/query.sql /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/query.sql
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/query.sql	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/query.sql	2025-12-05 22:38:42.000000000 +0000
@@ -0,0 +1,40 @@
+WITH newtab_content_live_deduped AS (
+  SELECT
+    *
+  FROM
+    `moz-fx-data-shared-prod.firefox_desktop_live.newtab_content_v1`
+  WHERE
+    DATE(submission_timestamp) = @submission_date
+  QUALIFY
+    ROW_NUMBER() OVER (
+      PARTITION BY
+        DATE(submission_timestamp),
+        document_id
+      ORDER BY
+        submission_timestamp
+    ) = 1
+),
+newtab_content_live_events AS (
+  SELECT
+    DATE(submission_timestamp) AS submission_date,
+    mozfun.map.get_key(event.extra, 'card_type') AS card_type,
+    mozfun.map.get_key(event.extra, 'corpus_item_id') AS corpus_item_id,
+    mozfun.map.get_key(event.extra, 'report_reason') AS report_reason,
+    mozfun.map.get_key(event.extra, 'section') AS section,
+    mozfun.map.get_key(event.extra, 'section_position') AS section_position,
+    mozfun.map.get_key(event.extra, 'title') AS title,
+    mozfun.map.get_key(event.extra, 'topic') AS topic,
+    mozfun.map.get_key(event.extra, 'url') AS url
+  FROM
+    newtab_content_live_deduped AS e
+  CROSS JOIN
+    UNNEST(e.events) AS event
+  WHERE
+    DATE(submission_timestamp) = @submission_date
+    AND event.category = 'newtab_content'
+    AND event.name = 'report_content_submit'
+)
+SELECT
+  *
+FROM
+  newtab_content_live_events
diff -bur --no-dereference --new-file /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/schema.yaml /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/schema.yaml
--- /tmp/workspace/main-generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/schema.yaml	1970-01-01 00:00:00.000000000 +0000
+++ /tmp/workspace/generated-sql/sql/moz-fx-data-shared-prod/firefox_desktop_derived/newtab_content_reported_content_v1/schema.yaml	2025-12-05 22:38:42.000000000 +0000
@@ -0,0 +1,37 @@
+fields:
+- name: submission_date
+  type: DATE
+  mode: NULLABLE
+  description: Day the event was received in the newtab content ping
+- name: card_type
+  type: STRING
+  mode: NULLABLE
+  description: The type of the content card (e.g., "spoc", "organic")
+- name: corpus_item_id
+  type: STRING
+  mode: NULLABLE
+  description: content identifier
+- name: report_reason
+  type: STRING
+  mode: NULLABLE
+  description: The reason selected by the user when reporting the content
+- name: section
+  type: STRING
+  mode: NULLABLE
+  description: If click belongs in a section, the name of the section
+- name: section_position
+  type: STRING
+  mode: NULLABLE
+  description: If click belongs in a section, the numeric position of the section
+- name: title
+  type: STRING
+  mode: NULLABLE
+  description: Title of the recommendation.
+- name: topic
+  type: STRING
+  mode: NULLABLE
+  description: The topic of the recommendation. Like "entertainment".
+- name: url
+  type: STRING
+  mode: NULLABLE
+  description: URL of the recommendation.

Link to full diff

Copy link
Contributor

@sean-rose sean-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+wc

A couple notes about deploying this:

  • New DAGs are added as paused by default and need to be manually enabled in the Airflow UI.
  • When adding ETLs in DAGs that run frequently it's likely the new ETL would end up running before the artifact deployment process could properly create the table. As a result the ETL would create the table without appropriate settings like partitioning, and then the artifact deployment process would fail because BigQuery doesn't allow changing the partitioning for a table (it has to be dropped and recreated). The easiest way to avoid that is pausing the DAG (works best with catchup enabled), waiting for artifact deployment to create the table, and then re-enabling the DAG. So in this case I'd recommend simply waiting to enable the DAG in the first place until artifact deployment has properly created the table.

# We reprocess the same day every hour up until 1:00 the following day, to give
# the live data time to come in
destination_table: >-
report_content_live_v1${{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking):

Suggested change
report_content_live_v1${{
newtab_content_reported_content_v1${{

Comment on lines +28 to +31
FROM
newtab_content_live_deduped AS e
CROSS JOIN
UNNEST(e.events) AS event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): e doesn't make sense to me as an alias here. Since the alias isn't really being used except in the cross join I'd suggest simply removing it.

Suggested change
FROM
newtab_content_live_deduped AS e
CROSS JOIN
UNNEST(e.events) AS event
FROM
newtab_content_live_deduped
CROSS JOIN
UNNEST(events) AS event

bqetl_newtab_hourly:
default_args:
depends_on_past: false
email_on_failure: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (blocking): It's recommended to have emails go to [email protected].

Suggested change
email_on_failure: true
email:
- [email protected]
- [email protected]
email_on_failure: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants