fix(reports): add network timeouts so schedules can't hang forever#41250
fix(reports): add network timeouts so schedules can't hang forever#41250rusackas wants to merge 2 commits into
Conversation
Alert/report execution made several network calls with no socket
timeout, so a Python socket would block indefinitely if the remote
endpoint became unreachable. When that happened mid-execution the report
schedule was left in the WORKING state permanently: every subsequent
tick then hit ReportSchedulePreviousWorkingError ("still working,
refusing to re-compute"), and manually resetting the state only led to
the next run wedging the same way. This explains reports getting stuck
in "sending" across all formats (CSV/PNG/PDF) after an environment
change, with no logs after the run starts.
Three unbounded calls are now bounded by configurable timeouts:
- SMTP send (superset/utils/core.py): smtplib.SMTP/SMTP_SSL get
SMTP_TIMEOUT (default 30s). Affects every email-based notification.
- Chart data fetch (superset/utils/csv.py): urllib opener.open() gets
ALERT_REPORTS_CSV_REQUEST_TIMEOUT (default 60s). Used by CSV/dataframe
report attachments.
- Selenium navigation (superset/utils/webdriver.py): driver.get() is
bounded by SCREENSHOT_PAGE_LOAD_WAIT (default 120s) via
set_page_load_timeout, so an unreachable page raises TimeoutException
instead of blocking.
With a finite timeout the failing call now raises, the state machine
transitions the schedule to ERROR, and the failure is surfaced/retried
instead of silently wedging the worker.
Fixes #40047
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Code Review Agent Run #8e6513Actionable Suggestions - 0Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
|
The flagged issue regarding missing type hints in the newly added test method is correct. To resolve this, you should update the method signature to include explicit type hints for all parameters and the return type (e.g., Since the file def test_your_method_name(patched_arg: PatchedType) -> None:
# implementationRegarding other comments, there are no additional review comments available in the provided PR context to check. |
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #41250 +/- ##
==========================================
- Coverage 64.36% 64.35% -0.01%
==========================================
Files 2651 2652 +1
Lines 144812 144869 +57
Branches 33417 33425 +8
==========================================
+ Hits 93208 93237 +29
- Misses 49935 49958 +23
- Partials 1669 1674 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code Review Agent Run #9a56c9Actionable Suggestions - 0Filtered by Review RulesBito filtered these suggestions based on rules created automatically for your feedback. Manage rules.
Review Details
Bito Usage GuideCommands Type the following command in the pull request comment and save the comment.
Refer to the documentation for additional commands. Configuration This repository uses Documentation & Help |
SUMMARY
Alert/report execution makes several network calls with no socket timeout, so the underlying Python socket blocks indefinitely if the remote endpoint becomes unreachable. When that happens mid-execution, the report schedule is left in the
WORKINGstate forever — every subsequent scheduler tick then raisesReportSchedulePreviousWorkingError("Report Schedule is still working, refusing to re-compute"), and manually resetting the state via SQL only causes the next run to wedge the same way.This matches the symptoms in #40047 exactly: reports stuck in "sending" across all formats (CSV/PNG/PDF), affecting both new and existing reports, with no logs after the run starts, appearing after an environment change (e.g. SMTP host or
WEBDRIVER_BASEURLbecoming unreachable) without any image rebuild. Because the hang is in a blocking C-level socket read, Celery'ssoft_time_limitoften can't interrupt it cleanly, and theworking_timeoutsweep only fires on a later tick — so the schedule stays wedged.Three previously-unbounded calls are now bounded by configurable timeouts:
smtplib.SMTP/SMTP_SSLemail sendsuperset/utils/core.pySMTP_TIMEOUT(30s)urllibchart-data fetch for CSV/dataframe attachmentssuperset/utils/csv.pyALERT_REPORTS_CSV_REQUEST_TIMEOUT(60s)driver.get()navigationsuperset/utils/webdriver.pySCREENSHOT_PAGE_LOAD_WAIT(120s) viaset_page_load_timeoutWith a finite timeout the failing call now raises instead of hanging; the report state machine transitions the schedule to
ERROR, the failure is surfaced/retried, and the worker is freed. The SMTP/CSV timeouts fall back to their defaults for custom configs that predate the new keys. All defaults can be set toNoneto restore the old unbounded behavior.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
N/A (backend reliability fix).
TESTING INSTRUCTIONS
pytest tests/unit_tests/utils/csv_tests.py tests/unit_tests/utils/webdriver_test.pypytest tests/integration_tests/email_tests.py -k send_mimeSMTP_HOST(orWEBDRIVER_BASEURL) at an unroutable address (e.g. a blackhole IP) and trigger a report. Before this change the worker hangs and the schedule sticks inWORKING; after, the call times out, the report moves toERROR, and the error is logged.New tests cover: the SMTP timeout being passed (and defaulting when the key is absent), the CSV fetch forwarding the timeout to
opener.open, and the Selenium driver applying / skippingset_page_load_timeout.ADDITIONAL INFORMATION
🤖 Generated with Claude Code