Skip to content

Conversation

@carlotaarvela
Copy link

We want a pipeline that schedules test runs and record benchmarking metrics.

Added cilium-hyperion-project topology
Update slo framework to include allow NO_OF_NAMESPACES param so it is customizable for test runs and added NO_OF_NAMESPACES and SMALL_GROUP_SIZE(number of replicas per deployment) to be included on the execute step
Add cilium-project-hyperion-baseline.yml pipeline and load-config
Pipeline

@carlotaarvela carlotaarvela marked this pull request as ready for review May 19, 2025 16:20
@carlotaarvela carlotaarvela requested a review from agrawaliti May 19, 2025 16:20
run_id=$(Build.BuildId)-$(System.JobId)
echo "Run ID: $run_id"
echo "##vso[task.setvariable variable=RUN_ID]$run_id"
displayName: "Set unique Run ID before publish"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumanthreddy29 This pipeline is to be run on a warm cluster. Removing this step will cause an error by creating a blob a non unique name

@@ -0,0 +1,111 @@
name: load-config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it to something else, as we already have a load-config file in slo.
in pipeline we pass " cl2_config_file: .yaml" so this can create some issue.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other frameworks have a load-config file, so I think we should follow the convention


# Config options for test parameters
{{$nodesPerNamespace := DefaultParam .CL2_NODES_PER_NAMESPACE 100}}
{{$podsPerNode := DefaultParam .CL2_PODS_PER_NODE 50}}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please change it to 40, i know you are passing it from pipeline too, but all our testing is with 40 pods per node, so keeping that default for your configs that make more sense

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this param

# Feature gates
{{$podStartupLatencyThreshold := DefaultParam .CL2_POD_STARTUP_LATENCY_THRESHOLD "15s"}}
{{$ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS_SIMPLE := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_API_CALL_PROMETHEUS_SIMPLE true}}
{{$PROMETHEUS_SCRAPE_KUBE_PROXY := DefaultParam .PROMETHEUS_SCRAPE_KUBE_PROXY false}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this parameter used in any test case? If not, please delete it and the corresponding if/else block below that relies on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{{$ENABLE_VIOLATIONS_FOR_NETWORK_PROGRAMMING_LATENCIES := DefaultParam .CL2_ENABLE_VIOLATIONS_FOR_NETWORK_PROGRAMMING_LATENCIES false}}
{{$NETWORK_LATENCY_THRESHOLD := DefaultParam .CL2_NETWORK_LATENCY_THRESHOLD "0s"}}
{{$PROBE_MEASUREMENTS_PING_SLEEP_DURATION := DefaultParam .CL2_PROBE_MEASUREMENTS_PING_SLEEP_DURATION "1s"}}
{{$ENABLE_IN_CLUSTER_NETWORK_LATENCY := DefaultParam .CL2_ENABLE_IN_CLUSTER_NETWORK_LATENCY true}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this parameter. If it's not used, delete it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Probe measurements shared parameter
{{$PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT := DefaultParam .CL2_PROBE_MEASUREMENTS_CHECK_PROBES_READY_TIMEOUT "15m"}}
{{$ENABLE_TERMINATED_WATCHES_MEASUREMENT := DefaultParam .CL2_ENABLE_TERMINATED_WATCHES_MEASUREMENT true}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is true by default and is not set anywhere in the pipeline. Can it be deleted and the corresponding measurement be added to file without if/else block?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@pavneeta
Copy link

pavneeta commented Jun 9, 2025

@carlotaarvela is this PR ready for review ? CC: @alexcastilio @agrawaliti @sumanthreddy29

@carlotaarvela carlotaarvela force-pushed the carlota/baseline-pipeline branch from dae5f03 to c1589dc Compare June 10, 2025 11:47
@carlotaarvela
Copy link
Author

@pavneeta Yes, this PR is ready for review
I have discussed the changes requested by Iti and Alex will do a final review today or tomorrow

Comment on lines +39 to +40
- name: ENV_VAR
value: a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this being used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

unit: s
queries:
- name: Perc99
query: histogram_quantile(0.99, sum(rate(cilium_service_implementation_delay_bucket[%v:])) by (le))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you checked if sum(rate(cilium_service_implementation_delay_bucket[%v:])) by (le) is really what you want? Perhaps you want sum_over_time(rate(<metric>)[<some interval to get rate>])[%v:]), similar to CPU and Memory metrics?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that sum_over_time would be useful to identify trends, however the metrics that you pointed out were added by the Cilium team 3+ months ago to benchmark the cilium performance on the slo framework - I just replicated them here to separate the logic for these tests.
Since we have been comparing the metrics collected in this test with cilium metrics framework, I think we should be collecting metrics the same way, so we are comparing apples with apples.
Besides that, we want to establish slos for these metrics.

Comment on lines +57 to +61
query: histogram_quantile(0.99, sum(rate(cilium_policy_implementation_delay_bucket[%v:])) by (le))
- name: Perc95
query: histogram_quantile(0.95, sum(rate(cilium_policy_implementation_delay_bucket[%v:])) by (le))
- name: Perc50
query: histogram_quantile(0.50, sum(rate(cilium_policy_implementation_delay_bucket[%v:])) by (le))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here and for other metrics

- resource
queries:
- name: Terminated watches
query: sum(increase(apiserver_terminated_watchers_total[%v:])) by (resource)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question about metrics here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants