Bug: Fix CI Error In Production#557
Merged
Merged
Conversation
Surfaced by the new Tier 2 lint job — the prior CI was either tolerating these or never ran sqlfluff cleanly against the models.
The Tier 2 SQL Server lane was failing on CI even though the sqlserver-configurator container ran init.sql successfully and exited 0. Root cause: `docker compose up --wait` treats "not running" as a failure regardless of exit code, so a container that runs to completion makes --wait return non-zero. This worked on developer laptops because the configurator was still mid-execution when --wait polled — a timing race that masked the bug. CI runners are fast enough to lose the race. Fix: split the service list in scripts/ci/compose-up.sh into long-running services (healthchecked, brought up with --wait) and init containers (brought up plain, then waited on with `docker compose wait`, a separate command that blocks on exit and surfaces the exit code). Convention used: anything matching `*-configurator` is treated as an init container. Easy to extend if other init sidecars are added later. Also retracted a wrong assertion in compose.yml — the comment on sqlserver-configurator claimed --wait treats exit-0 as success. It does not. Replaced with an accurate description of the constraint and a pointer to how compose-up.sh handles it. Verified locally: ./scripts/ci/test.sh sqlserver # PASS=49 ERROR=0
GitHub Actions is migrating its default JavaScript runtime from Node 20 to Node 24. The currently SHA-pinned action versions we use (actions/checkout@v4, astral-sh/setup-uv@v3, google-github-actions/auth@v2) are Node 20-based and would fail after the June 2, 2026 default flip. Set `ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION=true` at workflow scope in all four CI workflows (pr.yml, main.yml, release.yml, cut-release-candidate.yml). This keeps the existing pins working through the June flip and until we have time to regression-test the workflows against Node 24-compatible action versions. ⚠ Hard deadline: 2026-09-16. On that date GitHub removes Node 20 from the runner entirely. The env var stops working then. Before that date, bump each action SHA to a Node 24-compatible release, verify all four workflows pass, then remove the env block from each file. Tracked in specs/ci-rework/README.md §12.5 alongside the existing operational debt items. Dependabot (already configured) should surface compatible bumps automatically once the upstream actions tag Node 24 releases.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Post-merge CI fixes: lint, SQL Server, Node 20 mitigation
Cleanup PR for issues surfaced by the first real
main.ymlrun after theCI rework landed. Three independent fixes:
1. Fix lint violations in
dim_dbt__current_models+ staging modelsThe new Tier 2 lint job caught real style debt the prior CI was either
tolerating or never running cleanly. Two rules involved:
layout.cte_newline) — 8 violations across 7 files.Auto-fixed by
sqlfluff fix: blank line inserted after each inlineCTE closing bracket.
references.qualification) — 1 violation indim_dbt__current_models.sql. Resolved by aliasing the innersubquery's
basetable toband qualifying the reference there.Outer WHERE stays unqualified to satisfy RF03.
No SQL semantics changed — layout and reference-style only.
2. Fix
compose-up.shfor short-lived init containersThe Tier 2 SQL Server lane was failing even though
init.sqlransuccessfully. Root cause:
docker compose up --waittreats not runningas a failure regardless of exit code, so a container that runs to
completion makes
--waitreturn non-zero.This worked locally because the configurator was still mid-execution
when
--waitpolled — a timing race that masked the bug on devlaptops. CI runners are fast enough to lose the race.
Fix: split the service list in
scripts/ci/compose-up.shintolong-running (use
--wait) vs. init containers (usedocker compose wait— a different command that blocks on container exit andsurfaces the exit code). Also retracted a wrong comment in
compose.yml.3. Mitigate Node 20 deprecation warning
GitHub Actions is migrating its default JavaScript runtime from Node
20 to Node 24. Our SHA-pinned action versions
(
actions/checkout@v4,astral-sh/setup-uv@v3,google-github-actions/auth@v2) are Node 20-based and would failafter the June 2, 2026 default flip.
Added
ACTIONS_ALLOW_USE_UNSECURE_NODE_VERSION: "true"at workflowscope in all four CI workflows. Keeps things running through the flip
and until we regression-test against Node 24-compatible action
versions.
⚠ Hard deadline: 2026-09-16. On that date Node 20 is removed
entirely. Before then, bump action SHAs to Node 24-compatible
releases and remove the env block. Tracked in
specs/ci-rework/README.md §12.5. Dependabot should surface thebumps automatically.
Test plan
./scripts/ci/lint.shclean locally./scripts/ci/test.sh sqlserverpasses end-to-end locally (newinit-container path)
pushtomainafter merge: lint and SQL Server lanesboth green (previously both red)
didn't change) but workflows don't fail when the June flip
lands