Skip to content

chore: make Windows smoke block resilient to vault + setup failures#3344

Open
Daniel Ayaz (danielayaz) wants to merge 1 commit intomainfrom
fix-windows-smoke-vault
Open

chore: make Windows smoke block resilient to vault + setup failures#3344
Daniel Ayaz (danielayaz) wants to merge 1 commit intomainfrom
fix-windows-smoke-vault

Conversation

@danielayaz
Copy link
Copy Markdown
Member

Summary

Follow-up to #3342. The first scheduled run of the smoke pipeline against main had:

  • linux/arm64 — passed cleanly, metric emitted
  • windows/amd64 — failed at vault kv get -field=CONFLUENT_CLOUD_EMAIL … with Field "CONFLUENT_CLOUD_EMAIL" not present in secret. Because the failure was on a regular YAML command line, the job aborted before reaching the metric emitter, so the windows panel showed "No data" instead of 0.

This PR fixes both issues.

Changes (all in .semaphore/smoke-tests.yml, windows/amd64 block only)

1. Dynamic vault field handling

Replaced 3 hardcoded vault kv get -field=… calls with a Set-VaultFields PowerShell helper that:

  • Pulls each secret with vault kv get -format=json
  • Logs every field name it finds — so the next run will print e.g. Fields in v1/ci/kv/apif/cli/live-testing-data: email, password and we'll know exactly what's there
  • Exports each field under both its original name AND an UPPER_SNAKE_CASE variant, covering every common Vault → env-var convention (email / EMAIL, confluent-cloud-email / CONFLUENT_CLOUD_EMAIL, confluent_cloud_email, etc.) at once

This mirrors what vault-sem-get-secret does on Linux, which is why Linux works without specifying field names. vault-sem-get-secret ships with the Semaphore Linux agent toolbox and is not available on Windows agents.

2. Always emit the metric

The emitter binary is now built before the failable section. The vault auth + CLI build + smoke test are wrapped in one PowerShell try / catch / finally, and the finally clause always calls otel-smoke-metric.exe with the captured $RESULT:

  • $RESULT = "0" is the default
  • Only set to "1" after a clean test pass

So vault failures, build failures, test failures, or any thrown exception all report cli_smoke_test_result{os="windows",arch="amd64"} = 0 to Heracles. The windows panel in cc-terraform-monitoring#9639 will now go red on infra failures instead of going silent.

Test plan

  • Manually promote the smoke pipeline once after merge
  • Look at the windows job log — confirm the Fields in v1/ci/kv/apif/cli/live-testing-data: … line lists field names that include something matching CONFLUENT_CLOUD_EMAIL / CONFLUENT_CLOUD_PASSWORD (after upper-snake-case normalization)
  • If the test runs and passes: confirm cli_smoke_test_result{os="windows",arch="amd64"} = 1 shows up in Heracles
  • If the test still fails (e.g. fields really aren't there): confirm cli_smoke_test_result{os="windows",arch="amd64"} = 0 shows up — the new always-emit guarantee makes the failure visible

Out of scope

  • Linux block is unchanged.
  • Two-set-of-graphs in Grafana is already handled by the merged dashboard PR (cc-terraform-monitoring#9639) — separate panels per {os, arch} label, which is the standard Prometheus per-platform pattern.

🤖 Generated with Claude Code

…t metric

Two fixes after the first windows/amd64 run failed at vault auth:

1. The hardcoded `vault kv get -field=CONFLUENT_CLOUD_EMAIL` calls failed
   because the actual field names in v1/ci/kv/apif/cli/live-testing-data
   are not the same as the env var names the live test expects. Linux
   gets away with this because vault-sem-get-secret normalizes field
   names; vault-sem-get-secret is Linux-only.

   Replaced the hardcoded lookups with a Set-VaultFields helper that:
   - Pulls each secret as JSON
   - Logs the field names it found (so future failures are debuggable)
   - Exports every field under BOTH its original name AND an
     UPPER_SNAKE_CASE variant, covering every common naming convention
     (email/EMAIL, confluent-cloud-email, confluent_cloud_email, etc.)

2. Wrapped the entire vault + build + test sequence in one PowerShell
   try/catch/finally block. The emitter is built BEFORE this block, and
   the finally clause ALWAYS calls it with the final RESULT. So vault
   auth failures, build failures, test failures, or any thrown error
   now report cli_smoke_test_result=0 instead of leaving the windows
   panel showing "No data" on infra issues.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 2, 2026 00:56
@danielayaz Daniel Ayaz (danielayaz) requested a review from a team as a code owner May 2, 2026 00:56
@confluent-cla-assistant
Copy link
Copy Markdown

🎉 All Contributor License Agreements have been signed. Ready to merge.
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Windows/amd64 Semaphore smoke-test block to be more resilient to Vault/infra failures and to consistently report the OTLP smoke-test metric (so the Windows panel doesn’t show “No data” on failure).

Changes:

  • Replaces hardcoded vault kv get -field=... lookups with a PowerShell helper that reads Vault secrets as JSON and exports all fields (including an uppercased variant).
  • Wraps Vault auth + CLI build + smoke test in a try/catch/finally so Slack + metric emission happen even when earlier steps fail.
  • Builds the metric emitter before the failable section.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- $Env:SMOKE_COMMAND = "environment_list"

# Build emitter + CLI directly (no `make` available without bash; mirrors Makefile targets)
# Build the emitter FIRST so it's available even if vault/test setup fails later.
Set-VaultFields -Path "v1/ci/kv/apif/cli/live-testing-data"
Set-VaultFields -Path "v1/ci/kv/apif/cli/slack-notifications-live-testing"

# Build the CLI under test.
# Report pass/fail metric (never fails the pipeline; emitter always exits 0)
& .\bin\otel-smoke-metric.exe $RESULT
# Always emit the metric so the windows/amd64 panel never goes to "No data" on infra failure.
& .\bin\otel-smoke-metric.exe $RESULT
@sgagniere
Copy link
Copy Markdown
Member

Logs every field name it finds — so the next run will print e.g. Fields in v1/ci/kv/apif/cli/live-testing-data: email, password and we'll know exactly what's there

Can we guarantee that we won't also log the secret values themselves?

@sonarqube-confluent
Copy link
Copy Markdown

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants