Skip to content

Conversation

@pinky-nexthop
Copy link

@pinky-nexthop pinky-nexthop commented Dec 16, 2025

Summary

This PR introduces a comprehensive High Level Design document for PFC Watchdog Hardware Recovery feature that provides better accuracy in storm detection and restoration compared to software-based approaches.

Key Features

  • Hardware-based PFC recovery mechanism using SAI attributes with improved accuracy and reduced latency
  • Architecture refactoring with new PfcWdHwOrch orchestrator for clean separation of hardware/software logic
  • New CLI command: show pfcwd status with hardware-specific information including actual timer values and granularity
  • Flow diagrams illustrating hardware vs software recovery decision flows and detailed workflows
  • Comprehensive comparison between software and hardware recovery mechanisms

Hardware Recovery Advantages

  • Better accuracy in storm detection: Hardware operates at line rate without software polling delays
  • Better accuracy in storm restoration: More precise timing without dependency on software polling intervals
  • Reduced latency: Immediate hardware detection and recovery actions
  • Lower CPU overhead: Eliminates continuous software polling, reducing system load

Document Structure

  • Complete SONiC HLD format following established conventions
  • Placed in doc/pfcwd/ directory.
  • References existing PFC Watchdog Design and Test Plan documents
  • Includes SAI API specifications and comprehensive testing requirements

Technical Highlights

  • Hardware timer granularity constraints and actual vs configured values
  • Runtime platform capability detection for automatic hardware/software selection
  • Backward compatibility with existing software implementation
  • Error handling and validation for hardware constraints
  • Fully functional Table of Contents with clickable navigation

Review Focus Areas

  • Architecture design and orchestrator refactoring approach
  • SAI attribute usage and hardware integration strategy
  • CLI design and user experience for hardware-specific information
  • Testing strategy and validation approach for accuracy improvements

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

- Comprehensive design document for hardware-based PFC recovery
- Includes architecture refactoring proposal with PfcWdHwOrch
- New CLI command: show pfcwd status with hardware-specific information
- SAI attributes and implementation details for hardware recovery
- Flow diagrams for hardware vs software recovery decision and workflow
- Testing requirements and validation approach
- Hardware timer granularity constraints and actual vs configured values
- Comparison table highlighting differences between software and hardware recovery

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Changed 'CLI/YANG' to 'CLI-YANG' to fix markdown anchor link
- Forward slash in header was breaking clickable navigation

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Updated TOC to reflect actual document structure
- Removed non-existent subsections 12.2, 12.3, 12.4
- Only section 12.1 exists in the streamlined testing section

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Better accuracy in storm detection and restoration timing
- Reduced latency compared to software polling-based approach
- Lower CPU overhead by eliminating continuous software polling
- Hardware operates at line rate without software delays

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Section 13 Open Points does not exist in the document
- Cleaned up Table of Contents to match actual document structure

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Added CLI Data Flow subsection to TOC for proper navigation
- Fixes parsing/navigation issue in section 5

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Simplified the display table description in mermaid flowchart
- Removed pipe characters that could cause parsing issues
- Improved readability of the CLI data flow diagram

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
- Created doc/pfcwd/ directory for PFC Watchdog documentation
- Moved pfc_hardware_recovery_hld.md from doc/qos/ to doc/pfcwd/
- Better organization following SONiC documentation structure
- Dedicated directory allows for future PFC Watchdog related documents

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
@pinky-nexthop pinky-nexthop force-pushed the feature/pfc-hardware-recovery-hld branch from d793e08 to ee7c047 Compare December 16, 2025 11:41
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

- Added STATUS column to show pfcwd status command output to indicate
  success/failed programming status
- Updated CLI data flow diagram to include status validation
- Added example showing failed configuration due to unsupported timer ranges
- Updated hardware recovery workflow diagram to show event notification
  happening immediately after storm detection
- Added conditional logic for app_managed_recovery flag:
  - If true: wait for SAI_QUEUE_ATTR_PFC_DLR_INIT programming
  - If false: hardware automatically applies recovery action

Signed-off-by: Pinky Agrawal <pinky@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants