-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Priority: P2 (High - Operational Excellence)
Context
After fixing nested virtualization (Issue #8), we need comprehensive observability to detect and diagnose issues early.
Objective
Gain real-time visibility into Azure ML infrastructure and job health through monitoring, metrics, and alerting.
Implementation Plan: Week 2
Task 1: Integrate Azure Monitor [6 hours]
Collect infrastructure metrics from Azure ML compute:
New file: openadapt_evals/benchmarks/monitoring.py
from azure.monitor.query import MetricsQueryClient
class AzureMonitoringService:
def get_compute_metrics(self, compute_name: str) -> dict:
# Get CPU, memory, disk metrics for compute instance
# Query Azure Monitor API
# Return metrics dict
def check_job_health(self, job_name: str) -> bool:
# Check if job is healthy (making progress)
# Get recent logs (last 5 minutes)
# Check for error patterns
# Verify progress indicatorsMetrics to track:
- CPU utilization
- Memory utilization
- Disk utilization
- Container startup time
- Docker pull duration
Task 2: Enhanced Live Dashboard [4 hours]
Extend existing LiveEvaluationTracker with infrastructure metrics.
Updates to: openadapt_evals/benchmarks/live_tracker.py
Add infrastructure metrics, performance tracking, and cost monitoring to the live dashboard.
Task 3: Alerting System [4 hours]
Implement critical and warning alerts:
Critical alerts (immediate action):
- Job stuck with no progress (greater than 10 minutes)
- Container startup timeout (greater than 10 minutes)
- Compute node unhealthy (CPU/Memory/Disk greater than 95 percent)
Warning alerts (monitor and escalate):
- Slow task execution (greater than 2x average)
- High retry rate (greater than 20 percent of jobs)
- Docker pull slow (greater than 5 minutes)
Deliverables
- monitoring.py with Azure Monitor integration
- Enhanced LiveEvaluationTracker with infrastructure metrics
- Alerting system with critical/warning rules
- Updated viewer HTML with new metrics dashboard
Success Criteria
- Real-time infrastructure metrics visible in dashboard
- Alerts trigger within 5 minutes of issues
- Cost tracking accurate within 5 percent
- Dashboard shows all 3 metric layers (infrastructure, job, application)
Time Estimate
Total: 14 hours (2-3 days)
Multi-Layer Monitoring Architecture
Layer 1: Infrastructure Monitoring (Azure Monitor) - VM health, CPU, memory, disk
Layer 2: Job Lifecycle Monitoring - Job submission to completion
Layer 3: Application Monitoring (WAA tasks) - Task success/failure rates
Implementation Guide
Reference: /tmp/AZURE_LONG_TERM_SOLUTION.md - Section 5 and 7 (Phase 2)
Why This Matters
- Early detection of infrastructure issues
- Proactive alerting vs. reactive debugging
- Cost visibility and control
- Foundation for self-healing infrastructure
Dependencies
- Requires Issue [P1] Implement Azure nested virtualization fix (Week 1) #8 (Week 1 fixes) to be complete
- Azure Monitor API access configured
Related
- Week 1 implementation: Issue [P1] Implement Azure nested virtualization fix (Week 1) #8
- Cost optimization: Issue [P3] Azure cost optimization (Week 5) #9
- Long-term solution: /tmp/AZURE_LONG_TERM_SOLUTION.md