Skip to content

Reformat metric labels to consistently use snake case#4893

Open
JamesMurkin wants to merge 4 commits into
masterfrom
reformat_metrics
Open

Reformat metric labels to consistently use snake case#4893
JamesMurkin wants to merge 4 commits into
masterfrom
reformat_metrics

Conversation

@JamesMurkin

Copy link
Copy Markdown
Contributor

Add snakecase equivalent of all labels to our metrics, with the longer term goal of moving wholesale to snake case labels

This will take 2 steps:

  • Add snakecase labels equivalent to all metrics (this PR)
  • Remove all non-snake case labels
    • This will not happen for a while, to allow for various people to migrate to the new labels

@JamesMurkin JamesMurkin marked this pull request as ready for review April 30, 2026 14:36
@greptile-apps

greptile-apps Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds snake_case label equivalents alongside existing camelCase labels across Armada's scheduler, executor, and ingest metric descriptors — the first step of a two-phase migration toward consistent snake_case Prometheus labels. No old labels are removed, so existing dashboards and alert rules are unaffected during the transition window.

  • Descriptor + constructor parity: Every changed prometheus.NewDesc/NewCounterVec/NewGaugeVec definition has a matching update to its corresponding MustNewConstMetric/WithLabelValues call site, so label counts stay aligned throughout.
  • Naming choices: resourceType maps to resource (not resource_type), consistent with the pre-existing resourceLabel = "resource" constant; all other conversions follow direct camelCase→snake_case (e.g. priorityClasspriority_class, priorStateprior_state).
  • Bonus test fix: state_metrics_test.go corrects a pre-existing semantic error where jobErrorsByQueue and jobErrorsByNode were exercised with state/priorState label values instead of the correct category/subcategory values — a fix made necessary by the label-count expansion.

Confidence Score: 5/5

Safe to merge — the change is purely additive, no existing labels are removed, and label-count alignment between descriptors and call sites is consistent across all eight files.

Every descriptor addition has a correctly matching call-site update, the test suite is updated (and incidentally corrected) to reflect the new label cardinality, and no old labels are dropped, preserving backward compatibility for running dashboards.

No files require special attention; the mechanical repetitiveness of the change has been applied uniformly.

Important Files Changed

Filename Overview
internal/common/metrics/scheduler_metrics.go Largest change: adds snake_case equivalents (priority_class, price_band, resource, node_type, set_by_user) to ~30 metric descriptors and their corresponding constructor functions; label counts are consistent throughout.
internal/common/ingest/metrics/metrics.go Adds event_type and msg_type snake_case label equivalents to eventsProcessed counter vector; both RecordEventSequenceProcessed and RecordControlPlaneEventProcessed correctly supply values for all five labels.
internal/executor/metrics/pod_metrics/cluster_context.go Introduces resourceTypeLabelSnake ("resource") and nodeTypeLabelSnake ("node_type") constants and applies them to all five executor metric descriptors; Collect() call sites are updated to duplicate the values correctly.
internal/scheduler/metrics/constants.go Adds three snake_case label constants: nodeTypeLabelSnake, priorStateLabelSnake, overAllocatedLabelSnake; no conflicts with existing constants.
internal/scheduler/metrics/cycle_metrics.go Extends nodeLabels slice with node_type and over_allocated snake equivalents, and adds node_type/is_preemptible to the nodePreemptibility GaugeVec; all WithLabelValues call sites are updated with the correct duplicate values.
internal/scheduler/metrics/state_metrics.go Appends priorStateLabelSnake to six counter vector definitions and passes priorState twice in each WithLabelValues call to supply both camelCase and snake_case values.
internal/scheduler/metrics/cycle_metrics_test.go Updates nodeResourceLabelValues to include the new duplicated node_type and over_allocated values; tests correctly reflect the expanded label schema.
internal/scheduler/metrics/state_metrics_test.go Updates all label tuple arrays for the new prior_state column and separately introduces byQueueErrorLabels/byNodeErrorLabels to fix a pre-existing semantic mismatch where jobErrorsByQueue was being exercised with state/priorState values instead of category/subcategory values.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Metric Descriptor Definition\ne.g. prometheus.NewDesc / NewCounterVec] --> B[Label slice\ncamelCase labels kept\nsnake_case labels appended]
    B --> C[Constructor / WithLabelValues call\noriginal value passed twice:\nonce for camelCase label\nonce for snake_case label]
    C --> D[Prometheus time series\ncarries both label sets]
    D --> E1[Existing dashboards\nread camelCase labels\nunchanged]
    D --> E2[New dashboards / alerts\ncan use snake_case labels]
    E1 & E2 --> F[Phase 2 - future PR\nRemove camelCase labels]
Loading

Reviews (3): Last reviewed commit: "Merge branch 'master' into reformat_metr..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant