feat: Implement diagnostic logic for fault tolerance - log retrieval and error classification by Torino233 · Pull Request #1696 · intelligent-machine-learning/dlrover

Torino233 · 2026-02-10T09:23:06Z

Summary

This PR implements the diagnostic logic for the fault tolerance mechanism. When a failure occurs, the system now collects actor logs, parses them to classify error types, and determines failure responsibility based on timestamps.

Changes

1. Enhanced `DiagnosticInfo` class (`common/actor_base.py`)

Added comprehensive documentation for error codes and their corresponding reasons:
- code=0: Normal execution
- code=1: Unknown error
- code=1001: CUDA OOM (FATAL)
- code=1002: NCCL/Communication error
- code=2001: User code exception
- code=3001: System OOM killed (FATAL)

2. Implemented log retrieval and diagnosis (`backend/common/base_worker.py`)

Enhanced get_diagnostic() method to:
- Retrieve actor logs using ray.util.state.get_log(actor_id)
- Limit log size to 1MB to prevent memory issues
Added _parse_diagnostic_info() to classify errors based on log patterns:
- CUDA out of memory detection
- NCCL/Distributed communication errors
- System OOM kills
- User code exceptions

3. Added responsibility determination (`controller/manager.py`)

Introduced _determine_failure_responsibility() method:
- Sorts failed instances by timestamp
- Marks earliest failure as ROOT_CAUSE
- Marks subsequent failures as BE_AFFECTED
Integrated the call before is_failure_responsibility() check in _record_failure()

codecov · 2026-02-10T09:45:39Z

Codecov Report

❌ Patch coverage is 83.92857% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.46%. Comparing base (4c58d5e) to head (15d02f0).

Files with missing lines	Patch %	Lines
...rover/python/unified/backend/common/base_worker.py	80.55%	7 Missing ⚠️
dlrover/python/unified/controller/manager.py	87.50%	2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (83.92%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1696      +/-   ##
==========================================
- Coverage   80.48%   80.46%   -0.02%     
==========================================
  Files         228      228              
  Lines       23056    23107      +51     
==========================================
+ Hits        18557    18594      +37     
- Misses       4499     4513      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BalaBalaYi · 2026-02-12T03:01:26Z

+                    log_content = "...[truncated]...\n" + log_content
+            except Exception as e:
+                logger.warning(f"Failed to get logs for actor {actor_id}: {e}")
+                log_content = f"[Failed to retrieve logs: {e}]"


Leave this empty or return a specific error code or constant. This makes it easier for the subsequent diagnostic process to analyze the content.

BalaBalaYi · 2026-02-12T03:02:17Z

+        if actor_id:
+            try:
+                log_lines = []
+                for line in get_log(actor_id=actor_id):


need confirm: should tail the log

BalaBalaYi · 2026-02-12T03:02:31Z

+                logger.warning(f"Failed to get logs for actor {actor_id}: {e}")
+                log_content = f"[Failed to retrieve logs: {e}]"
+        else:
+            log_content = "[Actor ID not available, cannot retrieve logs]"


Torino233 added 2 commits February 10, 2026 17:08

feat: get diagnosticinfo

701e943

feat: get diagnosticinfo

f97c2f1

BalaBalaYi added the feature label Feb 10, 2026

BalaBalaYi added this to the v0.7.0 milestone Feb 10, 2026

Merge branch 'master' into feature/get-diagnosticinfo

f9c2614

BalaBalaYi reviewed Feb 12, 2026

View reviewed changes

Torino233 added 2 commits February 12, 2026 16:11

feat: get diagnosticinfo

1a670d3

feat: get diagnosticinfo

15d02f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement diagnostic logic for fault tolerance - log retrieval and error classification#1696

feat: Implement diagnostic logic for fault tolerance - log retrieval and error classification#1696
Torino233 wants to merge 5 commits intointelligent-machine-learning:masterfrom
Torino233:feature/get-diagnosticinfo

Torino233 commented Feb 10, 2026

Uh oh!

codecov Bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

BalaBalaYi Feb 12, 2026

Uh oh!

BalaBalaYi Feb 12, 2026

Uh oh!

BalaBalaYi Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Torino233 commented Feb 10, 2026

Summary

Changes

1. Enhanced DiagnosticInfo class (common/actor_base.py)

2. Implemented log retrieval and diagnosis (backend/common/base_worker.py)

3. Added responsibility determination (controller/manager.py)

Uh oh!

codecov Bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BalaBalaYi Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

BalaBalaYi Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

BalaBalaYi Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Enhanced `DiagnosticInfo` class (`common/actor_base.py`)

2. Implemented log retrieval and diagnosis (`backend/common/base_worker.py`)

3. Added responsibility determination (`controller/manager.py`)

codecov Bot commented Feb 10, 2026 •

edited

Loading