Skip to content

[P2] Azure observability and monitoring (Week 2) #10

@abrichr

Description

@abrichr

Priority: P2 (High - Operational Excellence)

Context

After fixing nested virtualization (Issue #8), we need comprehensive observability to detect and diagnose issues early.

Objective

Gain real-time visibility into Azure ML infrastructure and job health through monitoring, metrics, and alerting.

Implementation Plan: Week 2

Task 1: Integrate Azure Monitor [6 hours]

Collect infrastructure metrics from Azure ML compute:

New file: openadapt_evals/benchmarks/monitoring.py

from azure.monitor.query import MetricsQueryClient

class AzureMonitoringService:
    def get_compute_metrics(self, compute_name: str) -> dict:
        # Get CPU, memory, disk metrics for compute instance
        # Query Azure Monitor API
        # Return metrics dict
    
    def check_job_health(self, job_name: str) -> bool:
        # Check if job is healthy (making progress)
        # Get recent logs (last 5 minutes)
        # Check for error patterns
        # Verify progress indicators

Metrics to track:

  • CPU utilization
  • Memory utilization
  • Disk utilization
  • Container startup time
  • Docker pull duration

Task 2: Enhanced Live Dashboard [4 hours]

Extend existing LiveEvaluationTracker with infrastructure metrics.

Updates to: openadapt_evals/benchmarks/live_tracker.py

Add infrastructure metrics, performance tracking, and cost monitoring to the live dashboard.

Task 3: Alerting System [4 hours]

Implement critical and warning alerts:

Critical alerts (immediate action):

  • Job stuck with no progress (greater than 10 minutes)
  • Container startup timeout (greater than 10 minutes)
  • Compute node unhealthy (CPU/Memory/Disk greater than 95 percent)

Warning alerts (monitor and escalate):

  • Slow task execution (greater than 2x average)
  • High retry rate (greater than 20 percent of jobs)
  • Docker pull slow (greater than 5 minutes)

Deliverables

  • monitoring.py with Azure Monitor integration
  • Enhanced LiveEvaluationTracker with infrastructure metrics
  • Alerting system with critical/warning rules
  • Updated viewer HTML with new metrics dashboard

Success Criteria

  • Real-time infrastructure metrics visible in dashboard
  • Alerts trigger within 5 minutes of issues
  • Cost tracking accurate within 5 percent
  • Dashboard shows all 3 metric layers (infrastructure, job, application)

Time Estimate

Total: 14 hours (2-3 days)

Multi-Layer Monitoring Architecture

Layer 1: Infrastructure Monitoring (Azure Monitor) - VM health, CPU, memory, disk
Layer 2: Job Lifecycle Monitoring - Job submission to completion
Layer 3: Application Monitoring (WAA tasks) - Task success/failure rates

Implementation Guide

Reference: /tmp/AZURE_LONG_TERM_SOLUTION.md - Section 5 and 7 (Phase 2)

Why This Matters

  • Early detection of infrastructure issues
  • Proactive alerting vs. reactive debugging
  • Cost visibility and control
  • Foundation for self-healing infrastructure

Dependencies

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions