Skip to content

feat: Implement diagnostic logic for fault tolerance - log retrieval and error classification#1696

Draft
Torino233 wants to merge 5 commits intointelligent-machine-learning:masterfrom
Torino233:feature/get-diagnosticinfo
Draft

feat: Implement diagnostic logic for fault tolerance - log retrieval and error classification#1696
Torino233 wants to merge 5 commits intointelligent-machine-learning:masterfrom
Torino233:feature/get-diagnosticinfo

Conversation

@Torino233
Copy link
Copy Markdown
Collaborator

Summary

This PR implements the diagnostic logic for the fault tolerance mechanism. When a failure occurs, the system now collects actor logs, parses them to classify error types, and determines failure responsibility based on timestamps.

Changes

1. Enhanced DiagnosticInfo class (common/actor_base.py)

  • Added comprehensive documentation for error codes and their corresponding reasons:
    • code=0: Normal execution
    • code=1: Unknown error
    • code=1001: CUDA OOM (FATAL)
    • code=1002: NCCL/Communication error
    • code=2001: User code exception
    • code=3001: System OOM killed (FATAL)

2. Implemented log retrieval and diagnosis (backend/common/base_worker.py)

  • Enhanced get_diagnostic() method to:
    • Retrieve actor logs using ray.util.state.get_log(actor_id)
    • Limit log size to 1MB to prevent memory issues
  • Added _parse_diagnostic_info() to classify errors based on log patterns:
    • CUDA out of memory detection
    • NCCL/Distributed communication errors
    • System OOM kills
    • User code exceptions

3. Added responsibility determination (controller/manager.py)

  • Introduced _determine_failure_responsibility() method:
    • Sorts failed instances by timestamp
    • Marks earliest failure as ROOT_CAUSE
    • Marks subsequent failures as BE_AFFECTED
  • Integrated the call before is_failure_responsibility() check in _record_failure()

@BalaBalaYi BalaBalaYi added this to the v0.7.0 milestone Feb 10, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 83.92857% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.46%. Comparing base (4c58d5e) to head (15d02f0).

Files with missing lines Patch % Lines
...rover/python/unified/backend/common/base_worker.py 80.55% 7 Missing ⚠️
dlrover/python/unified/controller/manager.py 87.50% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (83.92%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1696      +/-   ##
==========================================
- Coverage   80.48%   80.46%   -0.02%     
==========================================
  Files         228      228              
  Lines       23056    23107      +51     
==========================================
+ Hits        18557    18594      +37     
- Misses       4499     4513      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

log_content = "...[truncated]...\n" + log_content
except Exception as e:
logger.warning(f"Failed to get logs for actor {actor_id}: {e}")
log_content = f"[Failed to retrieve logs: {e}]"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave this empty or return a specific error code or constant. This makes it easier for the subsequent diagnostic process to analyze the content.

if actor_id:
try:
log_lines = []
for line in get_log(actor_id=actor_id):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need confirm: should tail the log

logger.warning(f"Failed to get logs for actor {actor_id}: {e}")
log_content = f"[Failed to retrieve logs: {e}]"
else:
log_content = "[Actor ID not available, cannot retrieve logs]"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants