A clean framework for running GPU workloads with real-time power monitoring, analysis, and carbon footprint estimation.
carma_framework/
├── carma.sh # Main entry point script
├── env # Configuration file
├── analysis.sh # Real-time analysis script
├── pstree_logger.sh # Process tree logger
├── cleanup_all.sh # Cleanup script
├── sample_experiment_ft_small_ft_large.json # Example experiment config
├── monitors/ # Monitoring scripts
│ ├── nvidia_smi_power.sh # NVIDIA GPU power monitor
│ └── perf_monitor.sh # CPU performance monitor
├── wattsup_logger/ # Power meter logger
│ ├── wattsup_logger.sh # Main logger script
│ └── pm.py # Python power meter interface
├── workloads/ # Workload execution scripts
│ ├── run_workloads.py # Workload orchestrator
│ ├── finetune_model.py # Model finetuning workload
│ ├── inference_workload.sh # Inference workload
│ ├── requirements.txt # Python dependencies
│ └── configs/ # Workload configuration files
├── models/ # ML models for power prediction
│ ├── baseline/ # Baseline power prediction model
│ │ └── baseline.joblib
│ └── delta/ # Delta power prediction model
│ └── delta.joblib
└── runs/ # Output directory (created automatically)
cd /home/dhalder/sassy/gpu/carma_framework
./carma.sh sample_experiment_ft_small_ft_large.jsonThe experiment JSON file should have the following structure:
{
"experiment_name": "experiment_name",
"description": "Description of the experiment",
"workloads": [
{
"name": "workload_name",
"type": "finetune",
"gpu": true,
"script": "python3",
"args": [
"workloads/finetune_model.py",
"--model_name", "model_name",
"--max_steps", "60"
],
"cwd": ".",
"env": {"CUDA_VISIBLE_DEVICES": "0"}
}
]
}-
carma.sh is the main orchestrator that:
- Starts monitoring scripts (nvidia-smi, perf, wattsup)
- Launches workloads via
run_workloads.py - Starts process tree logging
- Runs real-time analysis during experiment execution
- After workloads complete, runs post-processing pipeline:
create_dataset.sh: Processes nsys profiles and GPU datapredict_power.py: Predicts power consumption from processed dataanalysis.sh: Calculates final carbon emissions for each job
- Handles cleanup on exit
-
Monitors collect data in parallel:
nvidia_smi_power.sh: GPU power consumptionperf_monitor.sh: CPU performance counterswattsup_logger.sh: System-wide power consumption
-
Real-time analysis.sh processes data during experiment:
- Correlates process trees with performance counters
- Apportions power consumption to individual workloads
- Calculates operational and embodied carbon in real-time
-
Post-processing pipeline (runs automatically after experiment):
create_dataset.sh: Extracts GPU metrics from nsys profiles (.sqlite files) and processes nvidia-smi datapredict_power.py: Uses trained ML models (frommodels/directory) to predict power consumption per jobanalysis.sh: Calculates final carbon emissions using predicted power data
-
Output is written to
runs/<experiment_name>_<timestamp>/:workload_pids.json: Mapping of workload names to PIDspstree.log: Process tree snapshotsperf.log: CPU performance countersnvidia_smi_power.csv: GPU power datawattsup_log.csv: System power dataevent_summaries.jsonl: Real-time analysis resultsevent_counts/: Processed GPU metrics from nsys profiles- Job-specific CSV files with predicted power and final carbon calculations
Edit env to configure:
- Monitor scripts to run
- NSYS profiler settings
- Sudo password file location (optional)
- Python 3 with required packages (see
workloads/requirements.txt) - NVIDIA drivers and nvidia-smi
- perf (Linux performance monitoring)
- nsys (NVIDIA Nsight Systems, optional)
- jq (JSON processor)
- wattsup power meter (optional, for physical power measurement)
- All paths in the framework are relative to the
carma_frameworkdirectory - The tool should be run from the
carma_frameworkdirectory - Workload scripts in JSON configs should use paths relative to the framework root (e.g.,
workloads/finetune_model.py)