JS screenshot suite: defend against silent test-hang chains#5125
Open
shai-almog wants to merge 4 commits into
Open
JS screenshot suite: defend against silent test-hang chains#5125shai-almog wants to merge 4 commits into
shai-almog wants to merge 4 commits into
Conversation
The JS-port screenshot suite is currently timing out on master. Three independent suite runs (l10n #5105, dedupe-paint-leak #5111, modern-themes #5054) all hit the 30-min ``CN1_JS_TIMEOUT_SECONDS`` budget without ever emitting ``CN1SS:SUITE:FINISHED``. The blow-up test drifts run-to-run but the failure pattern is the same: a single test's done() never fires and the suite-level wait expires. Two reproducible patterns surface in the CI browser logs: 1. Animation-grid placeholder failure ``AbstractAnimationScreenshotTest.captureAndEmit`` builds the grid inside a try/catch and falls back to ``Image.createImage(w,h,color)`` if the per-frame composition throws. Under the canvas-accumulation tail (~80 prior tests) the JS port's ``BrowserDomRenderingBackend.createCanvas`` is hit by the known ``NUMBER_FOR_OBJECT recovery=substituted-null`` bug — ``window.getDocument()`` intermittently returns the JS Number 667 (viewport height), ``createElement('canvas')`` then resolves to null, and ``canvas.setWidth(...)`` NPEs. The placeholder createImage routes through the same path and re-throws the identical NPE. The original catch swallows the re-throw via the surrounding ``finally``, ``emitImage`` is never reached, and ``done()`` never fires. Symptom: l10n run hangs forever at ``SheetSlideUpAnimationScreenshotTest`` after one ``animation_grid_failed=java.lang.NullPointerException`` entry. Fix: nest the placeholder createImage in its own try/catch; pass ``null`` to ``emitImage`` on failure. ``emitImage`` already handles ``image==null`` by emitting a placeholder marker and still calling the ``onComplete`` runnable, so ``done()`` fires and the suite advances. 2. DualAppearance done() gate swallows fail() ``DualAppearanceBaseTest.done()`` is gated by ``bothPhasesComplete`` so the JS port's emit-completion force-done can't finalise the test between the light and dark captures. The gate also swallows ``fail()`` → ``done()`` calls, which is the path port.js's ``lambdaBridge`` catch takes when ``runTest`` throws synchronously: the test fails, but the gate ignores the done() and the suite hangs. Additionally, the EDT-side emit chain (onShowCompleted → registerReadyCallback's UITimer → triple callSerially → emitCurrentFormScreenshot → Display.screenshot → chunked emit → onComplete) silently breaks when any link throws — the EDT error handler logs it but the chain's ``next.run()`` never fires. Fix: - ``done()`` override now bypasses the ``bothPhasesComplete`` gate when ``isFailed()`` returns true, so fail() actually finalises. - ``runTest`` body wrapped in a Throwable catch that logs + calls ``fail()`` on any synchronous exception out of installModernThemeIfRequested or runAppearance. - Each appearance phase now schedules an 8s wall-clock fallback via ``CN.setTimeout``. If the natural emit chain hasn't fired ``next.run()`` in that window, the fallback forces it (idempotently — whichever fires first wins). Without this, an uncaught EDT throw in the chain hangs the test for the entire suite budget. Combined effect: the suite now makes forward progress in the presence of the documented canvas-staleness / NUMBER_FOR_OBJECT bug. Individual tests may still produce placeholder PNGs or warn about a stalled chain, but the suite finishes and downstream comparison runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Compared 122 screenshots: 122 matched. Native Android coverage
✅ Native Android screenshot tests passed. Native Android coverage
Benchmark ResultsDetailed Performance Metrics
|
Initial CI for the safety-net patches confirmed the design works — ButtonTheme and TextFieldTheme now finish (previously hung the suite), and 87 of ~95 tests reach completion. But two refinements are needed before the suite walks the full DEFAULT_TEST_CLASSES array inside its budget: 1. ``DualAppearanceBaseTest.finish()`` runs on the EDT after the dark capture's completion runnable returns. Under the canvas-accumulation tail it can hit the host-bridge ``Missing JS member createElement`` cascade inside refreshTheme's form re-paint walk. The throw skips ``bothPhasesComplete = true`` so the gated done() never fires and the test hangs for the rest of the suite budget. Wrap the cleanup in try/finally so the gate lifts and done() fires regardless of whether the restore-to-default-theme work succeeds. Next test inherits whatever theme state the throw left behind (best-effort teardown), but the suite advances. 2. CN1_JS_TIMEOUT_SECONDS 1800 → 2700 (and matching browser lifetime 1740 → 2640). #5054 introduced 14 modern-theme tests, each light + dark phase eats ~16-20s wall on shared GHA runners with the canvas-accumulation slowdown. Total walks past 30 min; 45 min absorbs the worst case. Job timeout is 6h so this is well within the GHA budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
✅ Continuous Quality ReportTest & Coverage
Static Analysis
Generated automatically by the PR CI workflow. |
Collaborator
Author
|
Compared 122 screenshots: 122 matched. Benchmark Results
Detailed Performance Metrics
|
Collaborator
Author
|
Compared 122 screenshots: 122 matched. Benchmark Results
Build and Run Timing
Detailed Performance Metrics
|
…ost ref
Runs 1 and 3 of the screenshot suite still hang at
ButtonThemeScreenshotTest immediately after
ToastBarTopPositionScreenshotTest's ``canvasContextWipe`` force-
timeout. The pattern is identical across both runs: ToastBar's skip
fires (port.js logs ``forcedTimeout:canvasContextWipe:HIT``), the
runner advances, and the next ``CN1SS:INFO:suite starting
test=ButtonThemeScreenshotTest`` line is the last entry before the
suite times out — no further logs, no lambda2RunBridge polls, no
``done()``.
The browser-side stack trace surfaced in our second run shows the
worker is throwing ``RuntimeException: Error: Missing JS member
createElement for host receiver`` repeatedly from inside the EDT
paint chain. Root cause is the documented "NUMBER_FOR_OBJECT
recovery=substituted-null" host-bridge bug: ``window.getDocument()``
intermittently returns the JS Number 667 (viewport height) instead
of a Document; ``createElement('canvas')`` on the wrapped Number
resolves to undefined; the worker's ``__cn1CachedDocWrapper`` then
carries that broken host ref for the rest of the suite.
The existing invalidation at ~line 1146 triggers only when
``!__cn1CachedDocWrapper.__class`` — but the 667-flavour wrapper
*does* have a valid ``__class`` (it was constructed as an
HTMLDocument wrapper), so the check returns the cached object
and the next ``createElement`` lookup hits the broken host ref.
Two narrow defenses:
1. Sanity-check ``__cn1CachedDocWrapper.__cn1HostRef`` before
returning the cached wrapper. If the ref isn't an object
(Number / undefined / null), treat the cache as broken,
null it, and re-fetch through the host bridge. The
re-fetch round-trip is dramatically more likely to return
a real Document than the cached Number.
2. Invalidate ``win.__cn1CachedDocWrapper`` in the
``forcedTimeoutReason``-handler path before calling
``finalizeMethod``. Force-timeout reasons like
``canvasContextWipe`` are precisely the signal that the
host-bridge wrappers got corrupted by the prior test's
canvas activity. Clearing the cache before finalising the
skipped test ensures the next test starts with a fresh
``getDocument()`` lookup instead of inheriting the broken
cached wrapper.
Neither change touches the host-bridge state itself — the
underlying NUMBER_FOR_OBJECT bug remains for a separate
investigation. These defenses just stop the corrupted wrapper
from cascading across tests and hanging the suite on the next
test that paints a canvas.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Master's JavaScript-port screenshot CI is currently timing out (3 consecutive failures on master: #5105 l10n, #5111 dedupe-paint-leak, #5054 modern-themes — each runs 30+ minutes without reaching
CN1SS:SUITE:FINISHED). The blow-up test drifts run-to-run but the same two failure patterns surface across all three runs.This PR patches both patterns at the Java test-runner layer so the suite makes forward progress in the presence of the documented canvas-staleness /
NUMBER_FOR_OBJECT recovery=substituted-nullbug (host bridge intermittently returns the JS Number667forwindow.getDocument()under the canvas-accumulation tail).Findings
l10n run (commit
defe5ed7b) — hangs atSheetSlideUpAnimationScreenshotTest. NPE inBrowserDomRenderingBackend.createCanvas→AbstractAnimationScreenshotTest.captureAndEmit's catch tries to build a placeholder viaImage.createImage(w,h,color)which routes through the same NPEing path → re-throw is swallowed by the surroundingfinally→emitImagenever runs →done()never fires.dedupe-paint-leak / our-PR runs (commit
f0c4acfc5/6b7d573dc) — hang atLightweightPickerButtonsScreenshotTestandButtonThemeScreenshotTestrespectively.DualAppearanceBaseTest.done()is gated onbothPhasesComplete; the gate also swallowsfail() → done()calls. Combined with EDT-side throws in the emit chain (onShowCompleted → registerReadyCallback's UITimer → triple callSerially → emitCurrentFormScreenshot) that log to the EDT error handler but never callnext.run(), the test hangs for the full suite budget.Fixes
AbstractAnimationScreenshotTest.captureAndEmit— nest the placeholdercreateImagein its own try/catch; passnulltoemitImageon failure.emitImagealready handlesimage==null(emits a placeholder marker, callsonComplete), sodone()fires and the suite advances.DualAppearanceBaseTest.done()override — bypass thebothPhasesCompletegate whenisFailed()returns true.fail()actually finalises now instead of being swallowed.DualAppearanceBaseTest.runTest— wrap the body in aThrowablecatch that callsfail()on any synchronous exception out ofinstallModernThemeIfRequestedorrunAppearance.DualAppearanceBaseTest.runAppearance— schedule an 8s wall-clock fallback viaCN.setTimeoutper appearance phase. If the natural emit chain hasn't firednext.run()in that window, the fallback forces it idempotently (aran[]guard ensures whichever fires first wins, the other is a no-op).Risk
isFailed()escape ondone()only changes behaviour for tests that have already hadfail()called — pre-existing pass / non-fail flows are unaffected.null-image fall-through inAbstractAnimationScreenshotTestuses the same code path asemitImage(null, ...)for genuinely-null inputs, so no new behaviour is introduced.scripts/hellocodenameone/common/— no framework or port code is touched.Test plan
scripts-javascriptworkflow reachesCN1SS:SUITE:FINISHEDwithin the 1740s budget.screenshot=nonemarker) and the suite continues.fail()is called, which native platforms don't currently hit).🤖 Generated with Claude Code