Skip to content

NO-JIRA: Update e2e diagnostics#488

Open
theobarberbany wants to merge 1 commit intoopenshift:mainfrom
theobarberbany:tb/fix-e2e-diag
Open

NO-JIRA: Update e2e diagnostics#488
theobarberbany wants to merge 1 commit intoopenshift:mainfrom
theobarberbany:tb/fix-e2e-diag

Conversation

@theobarberbany
Copy link
Copy Markdown
Contributor

@theobarberbany theobarberbany commented Mar 5, 2026

  • Exclude e2e from ginkgo parallelism
  • Collect diagnostics in JustAfterEach (before cleanup) for spec-body
    failures; fall back to ReportAfterEach for cleanup failures
  • Use composite specKey hash to prevent collisions for table-driven specs
  • Add 30s context timeout to diagnostic API calls
  • Add stack traces to panic recovery
  • Add unit tests for the diagnostics infrastructure using a fake client,
    runnable via make unit without a cluster.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@theobarberbany theobarberbany changed the title Update e2e diagnostics NO-JIRA: Update e2e diagnostics Mar 5, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Reworks e2e diagnostics to collect context-bound diagnostic strings (returned by helpers instead of writing to GinkgoWriter), store per-spec diagnostics, thread a 5-minute timeout context through diagnostic collection calls, remove per-helper panic recover wrappers, inject stored diagnostics into JUnit failure messages, and add unit tests and test-invocation tweaks.

Changes

Cohort / File(s) Summary
Diagnostic collection helpers
e2e/e2e_common.go
Adds an unexported package-level specDiagnostics map; replaces dumpTrackedResources() with collectTrackedResourceDiagnostics(parentCtx context.Context) string that builds and returns diagnostics text using a derived context.WithTimeout(..., 5*time.Minute) and threads that context into dumpSingleResource(ctx, ...), describeObjectEvents(ctx, ...), dumpAllAWSMachineTemplates(ctx, ...), and dumpNamespaceEvents(ctx, ...); removes per-helper recover() logic and stops writing directly to GinkgoWriter.
Test reporting hooks & JUnit
e2e/e2e_test.go
Adds JustAfterEach to store diagnostics for failed specs into specDiagnostics keyed by specKey(...); updates ReportAfterEach to collect diagnostics only when missing and always reset resourcesUnderTest; modifies ReportAfterSuite to append pre-collected diagnostics into failed SpecReport.Failure.Message, switches to reporters.GenerateJUnitReportWithConfig with OmitFailureMessageAttr: true, and removes the top-level defer/recover in reporting.
Diagnostics unit test
e2e/e2e_diagnostics_test.go
Adds TestDiagnostics (build-tagged !e2e) and withFakeClient helper to exercise diagnostics collection using a controller-runtime fake client: tests present/absent resource messages, pre-populated specDiagnostics handling, and specKey uniqueness.
Makefile test targets
Makefile
unit target now runs go test -count=1 -race -run TestDiagnostics ./e2e/... in addition to existing unit steps; e2e target prepends GINKGO_EXTRA_ARGS="--tags=e2e $(GINKGO_EXTRA_ARGS)" to the e2e invocation.
Test script parallelism
hack/test.sh
When TEST_DIRS contains e2e, clears PARALLEL_FLAG so Ginkgo is invoked without -p/--procs for e2e runs; other script behavior unchanged.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Update e2e diagnostics' is vague and generic, using non-descriptive language that doesn't clearly convey the specific nature of the changes (removing duplicate diagnostic output via GinkgoWriter bypass). Consider a more specific title like 'Refactor e2e diagnostics to avoid duplication' or 'Bypass GinkgoWriter to prevent duplicate diagnostics' that better captures the main objective.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 5, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@theobarberbany: This pull request explicitly references no jira issue.

Details

In response to this:

Don't double print diagnostics, bypassing GinkgoWriter entirely to avoid duplication. Relying on CapturedGinkgoWriterOutput was causing the dupes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from RadekManak and chrischdi March 5, 2026 17:20
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/e2e_common.go`:
- Around line 71-73: The comment for specDiagnostics is inaccurate: it says the
map is keyed by “spec text” but the code keys entries by
LeafNodeLocation.String(); update the comment for specDiagnostics to state that
entries are keyed by the spec's LeafNodeLocation.String() (or simply “leaf node
location string”) and mention that ReportAfterEach populates it and
ReportAfterSuite reads it, so future maintainers won't assume spec text is the
key.

In `@e2e/e2e_test.go`:
- Around line 39-40: Replace the weak map key using only
report.LeafNodeLocation.String() when populating specDiagnostics (where diag :=
collectTrackedResourceDiagnostics() and
specDiagnostics[report.LeafNodeLocation.String()] = diag) with a composite key
that includes the leaf location plus the spec's unique text (e.g., combine
report.LeafNodeLocation.String() and report.FullText() or equivalent) so each
spec instance is unique; apply the same change to the other occurrence around
the specDiagnostics assignment at the later location (the lines referenced as
68-69) to prevent overwriting diagnostics for table-driven or helper-generated
specs.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 33c5a4e8-e29b-454b-af86-9e5a52c242eb

📥 Commits

Reviewing files that changed from the base of the PR and between db34340 and 07fcbc1.

📒 Files selected for processing (2)
  • e2e/e2e_common.go
  • e2e/e2e_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
e2e/e2e_test.go (1)

54-66: Diagnostics stored even when collection panics twice.

On line 61, diag is stored regardless of ok. If both JustAfterEach and ReportAfterEach panic during collection, this stores an empty string. This appears intentional to prevent infinite retry attempts, but the JUnit failure message will have an empty diagnostics section. Consider whether a placeholder like "(diagnostics unavailable due to collection failure)" would be more informative.

💡 Optional: Add placeholder for failed diagnostics
 		if _, exists := specDiagnostics[specKey(report)]; !exists {
 			diag, ok := collectTrackedResourceDiagnostics()
 			if !ok {
 				fmt.Fprintf(os.Stderr, "WARNING: diagnostics collection panicked in both JustAfterEach and ReportAfterEach for %s\n", report.FullText())
+				diag = "(diagnostics unavailable due to collection failure)"
 			}
 			specDiagnostics[specKey(report)] = diag
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/e2e_test.go` around lines 54 - 66, The ReportAfterEach block stores
whatever diag is returned from collectTrackedResourceDiagnostics into
specDiagnostics[specKey(report)] even when collectTrackedResourceDiagnostics
returns ok==false (i.e., collection panicked twice), leaving an empty
diagnostics string; change the logic in the ReportAfterEach closure (the
function using ReportAfterEach, specDiagnostics, specKey and
collectTrackedResourceDiagnostics) to only store diag when ok==true and
otherwise store a clear placeholder like "(diagnostics unavailable due to
collection failure)" so JUnit output is informative; ensure this replacement
happens before writing to specDiagnostics[specKey(report)] and preserve existing
behavior that prevents repeated retries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@e2e/e2e_test.go`:
- Around line 54-66: The ReportAfterEach block stores whatever diag is returned
from collectTrackedResourceDiagnostics into specDiagnostics[specKey(report)]
even when collectTrackedResourceDiagnostics returns ok==false (i.e., collection
panicked twice), leaving an empty diagnostics string; change the logic in the
ReportAfterEach closure (the function using ReportAfterEach, specDiagnostics,
specKey and collectTrackedResourceDiagnostics) to only store diag when ok==true
and otherwise store a clear placeholder like "(diagnostics unavailable due to
collection failure)" so JUnit output is informative; ensure this replacement
happens before writing to specDiagnostics[specKey(report)] and preserve existing
behavior that prevents repeated retries.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: cc24425c-a0a8-4296-8d6a-3f01b1903268

📥 Commits

Reviewing files that changed from the base of the PR and between 07fcbc1 and 4333b8e.

📒 Files selected for processing (5)
  • Makefile
  • e2e/e2e_common.go
  • e2e/e2e_diagnostics_test.go
  • e2e/e2e_test.go
  • hack/test.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • e2e/e2e_common.go

@theobarberbany
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-capi-techpreview

@nrb
Copy link
Copy Markdown
Contributor

nrb commented Mar 25, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2026
@openshift-ci-robot
Copy link
Copy Markdown

Tests from second stage were triggered manually. Pipeline can be controlled only manually, until HEAD changes. Use command to trigger second stage.

@sunzhaohua2
Copy link
Copy Markdown
Contributor

/test e2e-aws-capi-techpreview

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 27, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 27, 2026

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 27, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from nrb. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@e2e/e2e_common.go`:
- Around line 140-154: The current diagnostics code creates a single diagCtx
with a 30s deadline and reuses it for all dumps, causing later API calls to fail
if an earlier call is slow; instead, stop using one shared diagCtx and pass the
parent ctx into the dump calls (dumpSingleResource, dumpAllAWSMachineTemplates,
dumpNamespaceEvents) and ensure each individual API call inside those functions
wraps the parent ctx with its own context.WithTimeout(ctx, 30*time.Second)
(callCtx, cancel) / defer cancel() so each Get/List has its own 30s budget
rather than sharing a single deadline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 57e206f6-bc51-48a2-990f-e0224c68d7d7

📥 Commits

Reviewing files that changed from the base of the PR and between 4333b8e and 2707376.

📒 Files selected for processing (5)
  • Makefile
  • e2e/e2e_common.go
  • e2e/e2e_diagnostics_test.go
  • e2e/e2e_test.go
  • hack/test.sh
🚧 Files skipped from review as they are similar to previous changes (3)
  • hack/test.sh
  • Makefile
  • e2e/e2e_diagnostics_test.go

specDiagnostics[specKey(CurrentSpecReport())] = diag
}
}
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a node timeout this this JustAfterEach node, which will be applied to ctx.

@theobarberbany
Copy link
Copy Markdown
Contributor Author

@coderabbitai full

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 30, 2026

@theobarberbany Sure, I'll redo the full review of the PR!

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
e2e/e2e_test.go (1)

75-76: Minor: Use the already-dereferenced sr for consistency.

Line 75 accesses report.SpecReports[i] directly while sr (a pointer to the same element) is already available. Using *sr would be clearer.

♻️ Suggested fix
-		if diag, ok := specDiagnostics[specKey(report.SpecReports[i])]; ok {
+		if diag, ok := specDiagnostics[specKey(*sr)]; ok {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/e2e_test.go` around lines 75 - 76, The code uses report.SpecReports[i]
even though you already have a pointer sr to that element; replace
report.SpecReports[i] with the dereferenced pointer *sr when calling specKey and
when referencing that element for consistency (e.g., change
specKey(report.SpecReports[i]) to specKey(*sr) and any similar usages), keeping
the rest of the logic that appends diag to sr.Failure.Message unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@e2e/e2e_test.go`:
- Around line 75-76: The code uses report.SpecReports[i] even though you already
have a pointer sr to that element; replace report.SpecReports[i] with the
dereferenced pointer *sr when calling specKey and when referencing that element
for consistency (e.g., change specKey(report.SpecReports[i]) to specKey(*sr) and
any similar usages), keeping the rest of the logic that appends diag to
sr.Failure.Message unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: cb8d0e2a-25c7-4d56-9432-08ca15977fff

📥 Commits

Reviewing files that changed from the base of the PR and between 2707376 and 75107fb.

📒 Files selected for processing (5)
  • Makefile
  • e2e/e2e_common.go
  • e2e/e2e_diagnostics_test.go
  • e2e/e2e_test.go
  • hack/test.sh
✅ Files skipped from review due to trivial changes (1)
  • hack/test.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • Makefile

- Exclude e2e from ginkgo parallelism
- Collect diagnostics in JustAfterEach (before cleanup) for spec-body
failures; fall back to ReportAfterEach for cleanup failures
- Use composite specKey hash to prevent collisions for table-driven specs
- Add 30s context timeout to diagnostic API calls
- Add stack traces to panic recovery
- Add unit tests for the diagnostics infrastructure using a fake client,
  runnable via make unit without a cluster.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
e2e/e2e_common.go (1)

126-141: ⚠️ Potential issue | 🟠 Major

Use per-call timeouts instead of one shared diagnostics budget.

diagCtx at Line 127 is reused for every Get and List in this dump. One slow fetch can consume most or all of the 5-minute budget and starve the remaining resource and event queries, which is exactly when these diagnostics are most useful. Pass parentCtx through here and wrap each individual API call in its own short context.WithTimeout(...) instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/e2e_common.go` around lines 126 - 141, The diagnostics function
collectTrackedResourceDiagnostics currently creates a single diagCtx used for
all API calls (dumpSingleResource, dumpAllAWSMachineTemplates,
dumpNamespaceEvents) which can let one slow call exhaust the 5-minute budget;
change the pattern to pass parentCtx into these dump helpers and have each API
operation inside them create its own short timeout (e.g.,
context.WithTimeout(parentCtx, shortDuration)) so each Get/List has a per-call
deadline instead of sharing diagCtx. Update dumpSingleResource,
dumpAllAWSMachineTemplates, and dumpNamespaceEvents signatures/usages to accept
parentCtx and implement per-call context.WithTimeout wrappers around individual
client calls. Ensure cancels are deferred for each per-call context.
🧹 Nitpick comments (1)
e2e/e2e_diagnostics_test.go (1)

71-80: This subtest doesn't exercise the skip path it names.

It only pre-populates specDiagnostics and checks the map. If ReportAfterEach stopped honoring the exists guard and overwrote the first entry, this would still pass. Consider extracting the store-if-absent branch into a small helper and testing that directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@e2e/e2e_diagnostics_test.go` around lines 71 - 80, The test currently only
pre-populates specDiagnostics and checks the map but never exercises the actual
"skip" logic in ReportAfterEach; extract the store-if-absent branch into a small
helper (e.g., AddSpecDiagnosticIfNotExists or StoreSpecDiagnosticIfAbsent) that
accepts a key and diagnostics (from collectTrackedResourceDiagnostics) and
returns whether it stored or skipped, move the map-write logic that currently
lives in ReportAfterEach into that helper, then update the subtest to call the
helper directly and assert it returns "skipped" when the key already exists and
"stored" when it does not, while keeping ReportAfterEach calling the helper so
behavior stays unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@e2e/e2e_common.go`:
- Around line 126-141: The diagnostics function
collectTrackedResourceDiagnostics currently creates a single diagCtx used for
all API calls (dumpSingleResource, dumpAllAWSMachineTemplates,
dumpNamespaceEvents) which can let one slow call exhaust the 5-minute budget;
change the pattern to pass parentCtx into these dump helpers and have each API
operation inside them create its own short timeout (e.g.,
context.WithTimeout(parentCtx, shortDuration)) so each Get/List has a per-call
deadline instead of sharing diagCtx. Update dumpSingleResource,
dumpAllAWSMachineTemplates, and dumpNamespaceEvents signatures/usages to accept
parentCtx and implement per-call context.WithTimeout wrappers around individual
client calls. Ensure cancels are deferred for each per-call context.

---

Nitpick comments:
In `@e2e/e2e_diagnostics_test.go`:
- Around line 71-80: The test currently only pre-populates specDiagnostics and
checks the map but never exercises the actual "skip" logic in ReportAfterEach;
extract the store-if-absent branch into a small helper (e.g.,
AddSpecDiagnosticIfNotExists or StoreSpecDiagnosticIfAbsent) that accepts a key
and diagnostics (from collectTrackedResourceDiagnostics) and returns whether it
stored or skipped, move the map-write logic that currently lives in
ReportAfterEach into that helper, then update the subtest to call the helper
directly and assert it returns "skipped" when the key already exists and
"stored" when it does not, while keeping ReportAfterEach calling the helper so
behavior stays unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: df5bef14-0dc3-45ed-adb8-e4027729ca51

📥 Commits

Reviewing files that changed from the base of the PR and between 75107fb and 0c4ab88.

📒 Files selected for processing (5)
  • Makefile
  • e2e/e2e_common.go
  • e2e/e2e_diagnostics_test.go
  • e2e/e2e_test.go
  • hack/test.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • hack/test.sh

@theobarberbany
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-capi-techpreview

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 30, 2026

@theobarberbany: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants