-
Notifications
You must be signed in to change notification settings - Fork 124
DENG-8767: Create events first seen table #8046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I think CI might be failing because in the generated metadata.yaml file the scheduling:
dag_name: bqetl_glean_usage
task_group: { { app_name } }
date_partition_parameter: null |
sql_generators/glean_usage/templates/events_first_seen_v1.metadata.yaml
Outdated
Show resolved
Hide resolved
…data.yaml Co-authored-by: Anna Scholtz <[email protected]>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…create-events-first-seen-table
…:mozilla/bigquery-etl into DENG-8767-create-events-first-seen-table
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Integration report for "Merge branch 'main' into DENG-8767-create-events-first-seen-table"
|
| apps: | ||
| firefox_desktop: | ||
| criteria: | ||
| - name: CAST(NULL as string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestions (non-blocking):
- I'd advise against using null as a special criteria name value here because nulls are harder to work with and easy to misinterpret. Instead I'd suggest using some more meaningful name like
any_event(or even a less scrutable special value like*would be better than null IMO). - I'd also advise against using raw SQL syntax for the criteria names, as that's making it difficult to configure things correctly (as evidenced by the comment above about how to YAML-encode the SQL strings for criteria names, which IMO ideally shouldn't be necessary).
- Even if you decide you want to stick with using null as a special criteria name value, you could encode that as a YAML
nullliteral here and then do something like{{ ("'" ~ item["name"] ~ "'") if item["name"] else "CAST(NULL AS STRING)" }} AS criteriain the ETL query template.
- Even if you decide you want to stick with using null as a special criteria name value, you could encode that as a YAML
And if you accept both of those suggestions you could also encode the criteria in a more natural way as a dictionary, like:
apps:
<app name>:
criteria:
<criteria name>: |
<criteria SQL>
(in which case you'd want to loop over the criteria in the ETL query template like {% for criteria_name, criteria_sql in criteria.items() %})
| - name: "'chatbot_usage'" | ||
| sql: event_category = 'genai.chatbot' AND event_name = 'sidebar_toggle' AND JSON_VALUE(event_extra.provider) <> 'none' | ||
| - name: "'smart_tabgroup_save'" | ||
| sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): IMO these criteria SQL values would be easier to read/edit if they were formatted on multiple lines in YAML block scalars, like:
| sql: event_category = 'tabgroup' AND ((event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save')) | |
| sql: | | |
| event_category = 'tabgroup' | |
| AND ( | |
| (event_name = 'smart_tab_suggest' AND JSON_VALUE(event_extra.action) LIKE 'save%') | |
| OR (event_name = 'smart_tab_topic' AND JSON_VALUE(event_extra.action) = 'save') | |
| ) |
(if you accept this suggestion I'd recommend consistently using YAML block scalars for all criteria SQL values, not just the very long ones)
| WITH eventsstream AS ( | ||
| SELECT | ||
| DATE(MIN(submission_timestamp)) as submission_date, | ||
| DATE(MIN(submission_timestamp)) as event_first_seen_date, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questions (non-blocking):
- Why have
submission_dateandevent_first_seen_datecolumns with the exact same value? - Why have these date columns when you have the more precise
first_submission_timestampcolumn?
| `event`, | ||
| event_category, | ||
| event_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): Are you absolutely certain you want to group by the event/category/name?
Some of the configured criteria could match multiple types of events, so grouping by event/category/name means you could get multiple rows per client for those criteria (one for each distinct event type that matches the criteria's conditions).
| min_by(profile_group_id, submission_timestamp) AS profile_group_id, | ||
| min_by(sample_id, submission_timestamp) AS sample_id, | ||
| MIN(submission_timestamp) AS first_submission_timestamp, | ||
| MIN(event_timestamp) AS first_event_timestamp, | ||
| min_by(event_extra, submission_timestamp) AS event_extra, | ||
| min_by(app_version_major, submission_timestamp) AS app_version_major, | ||
| min_by(normalized_channel, submission_timestamp) AS normalized_channel, | ||
| min_by(normalized_country_code, submission_timestamp) AS normalized_country_code, | ||
| min_by(normalized_os, submission_timestamp) AS normalized_os, | ||
| min_by(normalized_os_version, submission_timestamp) AS normalized_os_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issues (blocking):
- Events submitted in the same
eventsping will all have the samesubmission_timestamp, so there could easily be ties based onsubmission_timestampvalues and theseMIN_BY()calls might not end up selecting the properties for the actual first such event. - Even if you also compare using
event_timestampto breaksubmission_timestampties it's currently impossible to guarantee there won't be any ties, as events could also have the sameevent_timestamp(though my planned solution for DENG-9800 might eventually help with this). And if ties happen then the multipleMIN_BY()calls here aren't necessarily guaranteed to all end up getting their data from the same event record.
So I'd suggest using event_timestamp in the comparison and getting all the necessary data in a single aggregate call. Unfortunately, event_timestamp can have some wild values (e.g. far in the past or future) so a simple COALESCE(event_timestamp, submission_timestamp) probably isn't safe to use, and MIN_BY() doesn't allow comparing multiple values. However, you could do something like this:
ARRAY_AGG(
STRUCT(
profile_group_id,
sample_id,
submission_timestamp AS first_submission_timestamp,
event_timestamp AS first_event_timestamp,
event_extra,
app_version_major,
normalized_channel,
normalized_country_code,
normalized_os,
normalized_os_version
)
ORDER BY
submission_timestamp,
event_timestamp NULLS LAST
LIMIT 1
)[0].*| -- initialize by looking over all of history | ||
| DATE(submission_timestamp) >= '2023-01-01' | ||
| -- AND sample_id >= @sample_id | ||
| -- AND sample_id < @sample_id + @sampling_batch_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (possibly blocking): As it stands the init query for firefox_desktop_derived.events_first_seen_v1 is going to be scanning petabytes of data, so I suspect having the init query run per sample ID may be necessary to avoid having that take an exceedingly long time (blocking the artifact deployment process, and possibly even hitting the 6-hour query runtime limit).
@BenWu what do you think? Any advice on how to estimate how long such a huge query is likely to take?
| DATE(submission_timestamp) >= '2023-01-01' | ||
| -- AND sample_id >= @sample_id | ||
| -- AND sample_id < @sample_id + @sampling_batch_size | ||
| AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): It'd be helpful to have a comment explaining why these particular event categories are being excluded.
| -- AND sample_id >= @sample_id | ||
| -- AND sample_id < @sample_id + @sampling_batch_size | ||
| AND event_category NOT IN ('media.playback', 'nimbus_events', 'uptake.remotecontent.result') | ||
| -- if app_id is firefox_desktop, filter for where profile_group_id is not null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestions (non-blocking):
- This comment explains what is being done, but we could already tell that by looking at the code in question. Better would be having a comment that explains why this is only including Firefox Desktop events that have
profile_group_idvalues. - In any case, the comment should go within the
{% if app_id == 'firefox_desktop' -%}block (otherwise the generated ETL queries for other apps will confusingly contain the comment but not the associated SQL code).
| min_by(normalized_channel, submission_timestamp) AS normalized_channel, | ||
| min_by(normalized_country_code, submission_timestamp) AS normalized_country_code, | ||
| min_by(normalized_os, submission_timestamp) AS normalized_os, | ||
| min_by(normalized_os_version, submission_timestamp) AS normalized_os_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): Apparently normalized_os_version ends up being 10.0 for both Windows 10 and 11 (though IMO that seems like a bug we should fix), so I'd suggest also including client_info.windows_build_number to allow differentiating between Windows 10 and 11 (e.g. DENG-8570).
| AND _current.event = _previous.event | ||
| AND (_current.criteria = _previous.criteria | ||
| OR (_current.criteria IS NULL AND _previous.criteria IS NULL)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- If you decide not to group by event/category/name then the
eventjoin condition here should be removed. - If you decide to avoid using null criteria name values then the
criteriajoin condition here could be simplified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestions (non-blocking):
- Formatting this file in a more
bqetl format-like way would help with readability. - There's a fair amount of duplicated logic between the init and non-init parts of the SQL which would be good to avoid if possible.
| - mode: NULLABLE | ||
| name: submission_date | ||
| type: DATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: IMO it's good to keep the schema field attributes consistently in the order name, type, mode, description, fields for readability, rather than having the attributes be sorted alphabetically like this.
| self.per_app_id_enabled = True | ||
| self.across_apps_enabled = False | ||
| self.cross_channel_template = "cross_channel_events_first_seen.view.sql" | ||
| self.base_table_name = "events_v1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): While this doesn't really matter since you're using an opt-in approach, it would be more accurate to define the base table as events_stream_v1.
| self.base_table_name = "events_v1" | |
| self.base_table_name = "events_stream_v1" |
| parallelism=parallelism, | ||
| id_token=id_token, | ||
| custom_render_kwargs={ | ||
| "app_id": app_id_info["bq_dataset_family"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): IMO calling this value app_id is a bit confusing/misleading, so I'd suggest changing this to something like app_id_dataset (and updating the associated Jinja).
Description
This PR adds Events First Seen into the Glean Usage SQL generator.
Related Tickets & Documents
Reviewer, please follow this checklist