|
| 1 | +# Observability |
| 2 | + |
| 3 | +UCM (Unified Cache Management) provides detailed metrics monitoring through Prometheus endpoints, allowing in-depth monitoring of cache performance and behavior. This document describes how to enable and configure observability from the embedded vLLM `/metrics` API endpoint. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Quick Start Guide |
| 8 | + |
| 9 | +### 1) On UCM Side |
| 10 | + |
| 11 | +First, set the `PROMETHEUS_MULTIPROC_DIR` environment variable. |
| 12 | + |
| 13 | +```bash |
| 14 | +export PROMETHEUS_MULTIPROC_DIR=/vllm-workspace |
| 15 | +``` |
| 16 | + |
| 17 | +Then, start the UCM service. |
| 18 | + |
| 19 | +```bash |
| 20 | +export CUDA_VISIBLE_DEVICES=0 |
| 21 | +vllm serve /home/models/Qwen2.5-14B-Instruct \ |
| 22 | + --max-model-len 5000 \ |
| 23 | + --tensor-parallel-size 1 \ |
| 24 | + --gpu_memory_utilization 0.87 \ |
| 25 | + --trust-remote-code \ |
| 26 | + --disable-log-requests \ |
| 27 | + --no-enable-prefix-caching \ |
| 28 | + --enforce-eager \ |
| 29 | + --max-num-batched-tokens 40000 \ |
| 30 | + --max-num-seqs 10 \ |
| 31 | + --host 0.0.0.0 \ |
| 32 | + --port 8000 \ |
| 33 | + --kv-transfer-config \ |
| 34 | + '{ |
| 35 | + "kv_connector": "UCMConnector", |
| 36 | + "kv_connector_module_path": "ucm.integration.vllm.ucm_connector", |
| 37 | + "kv_role": "kv_both", |
| 38 | + "kv_connector_extra_config": { |
| 39 | + "UCM_CONFIG_FILE": "/vllm-workspace/unified-cache-management/examples/ucm_config.yaml" |
| 40 | + } |
| 41 | + }' |
| 42 | +``` |
| 43 | +**Note**: You can refer to the `ucm_config.yaml` file at https://github.com/ModelEngine-Group/unified-cache-management/tree/develop/examples to configure the `metrics_config_path` parameter. |
| 44 | + |
| 45 | +You can use the `vllm bench serve` command to run benchmarks: |
| 46 | + |
| 47 | +```bash |
| 48 | +vllm bench serve \ |
| 49 | + --backend vllm \ |
| 50 | + --model /home/models/Qwen2.5-14B-Instruct \ |
| 51 | + --host 127.0.0.1 \ |
| 52 | + --port 8000 \ |
| 53 | + --dataset-name random \ |
| 54 | + --num-prompts 20 \ |
| 55 | + --random-input-len 200 \ |
| 56 | + --random-output-len 10 \ |
| 57 | + --request-rate 1 \ |
| 58 | + --ignore-eos |
| 59 | +``` |
| 60 | + |
| 61 | +Once the HTTP server is running, you can access the UCM metrics at the `/metrics` endpoint. |
| 62 | + |
| 63 | +```bash |
| 64 | +curl http://$<vllm-worker-ip>:8000/metrics | grep ucm: |
| 65 | +``` |
| 66 | + |
| 67 | +You will also find some `.db` files in the `$PROMETHEUS_MULTIPROC_DIR` directory, which are temporary files used by Prometheus. |
| 68 | + |
| 69 | +### 2) Start Prometheus and Grafana with Docker Compose |
| 70 | + |
| 71 | +#### Create Docker Compose Configuration Files |
| 72 | + |
| 73 | +First, create the `docker-compose.yaml` file: |
| 74 | + |
| 75 | +```yaml |
| 76 | +# docker-compose.yaml |
| 77 | +version: "3" |
| 78 | + |
| 79 | +services: |
| 80 | + prometheus: |
| 81 | + image: prom/prometheus:latest |
| 82 | + extra_hosts: |
| 83 | + - "host.docker.internal:host-gateway" |
| 84 | + ports: |
| 85 | + - "9090:9090" |
| 86 | + volumes: |
| 87 | + - ${PWD}/prometheus.yaml:/etc/prometheus/prometheus.yml |
| 88 | + |
| 89 | + grafana: |
| 90 | + image: grafana/grafana:latest |
| 91 | + depends_on: |
| 92 | + - prometheus |
| 93 | + ports: |
| 94 | + - "3000:3000" |
| 95 | +``` |
| 96 | +
|
| 97 | +Then, create the `prometheus.yaml` configuration file: |
| 98 | + |
| 99 | +```yaml |
| 100 | +# prometheus.yaml |
| 101 | +global: |
| 102 | + scrape_interval: 5s |
| 103 | + evaluation_interval: 30s |
| 104 | +
|
| 105 | +scrape_configs: |
| 106 | + - job_name: vllm |
| 107 | + static_configs: |
| 108 | + - targets: |
| 109 | + - 'host.docker.internal:8000' |
| 110 | +``` |
| 111 | + |
| 112 | +**Note**: Make sure the port number in `prometheus.yaml` matches the port number used when starting the vLLM service. |
| 113 | + |
| 114 | +#### Start Services |
| 115 | + |
| 116 | +Run the following command in the directory containing `docker-compose.yaml` and `prometheus.yaml`: |
| 117 | + |
| 118 | +```bash |
| 119 | +docker compose up |
| 120 | +``` |
| 121 | + |
| 122 | +This will start Prometheus and Grafana services. |
| 123 | + |
| 124 | +### 3) Configure Grafana Dashboard |
| 125 | + |
| 126 | +#### Access Grafana |
| 127 | + |
| 128 | +Navigate to `http://<your-host>:3000`. Log in with the default username (`admin`) and password (`admin`). You will be prompted to change the password on first login. |
| 129 | + |
| 130 | +#### Add Prometheus Data Source |
| 131 | + |
| 132 | +1. Navigate to `http://<your-host>:3000/connections/datasources/new` and select **Prometheus**. |
| 133 | + |
| 134 | +2. On the Prometheus configuration page, add the Prometheus server URL in the **Connection** section. For this Docker Compose setup, Grafana and Prometheus run in separate containers, but Docker creates DNS names for each container. You can directly use `http://prometheus:9090`. |
| 135 | + |
| 136 | +3. Click **Save & Test**. You should see a green checkmark showing "Successfully queried the Prometheus API." |
| 137 | + |
| 138 | +#### Import Dashboard |
| 139 | + |
| 140 | +1. Navigate to `http://<your-host>:3000/dashboard/import`. |
| 141 | + |
| 142 | +2. Click **Upload JSON file**, then upload the `unified-cache-management/examples/metrics/grafana.json` file. |
| 143 | + |
| 144 | +3. Select the Prometheus data source configured earlier. |
| 145 | + |
| 146 | +4. Click **Import** to complete the import. |
| 147 | + |
| 148 | +You should now be able to see the UCM monitoring dashboard with real-time visualization of all 9 metrics. |
| 149 | + |
| 150 | +## Available Metrics |
| 151 | + |
| 152 | +UCM exposes various metrics to monitor its performance. The following table lists all available metrics organized by category: |
| 153 | + |
| 154 | +| Metric Name | Type | Description | |
| 155 | +|------------|------|-------------| |
| 156 | +| **Load Operation Metrics** | | | |
| 157 | +| `ucm:load_requests_num` | Histogram | Number of requests loaded per `start_load_kv` call | |
| 158 | +| `ucm:load_blocks_num` | Histogram | Number of blocks loaded per `start_load_kv` call | |
| 159 | +| `ucm:load_duration` | Histogram | Time to load KV cache from UCM (milliseconds) | |
| 160 | +| `ucm:load_speed` | Histogram | Speed of loading from UCM (GB/s) | |
| 161 | +| **Save Operation Metrics** | | | |
| 162 | +| `ucm:save_requests_num` | Histogram | Number of requests saved per `wait_for_save` call | |
| 163 | +| `ucm:save_blocks_num` | Histogram | Number of blocks saved per `wait_for_save` call | |
| 164 | +| `ucm:save_duration` | Histogram | Time to save to UCM (milliseconds) | |
| 165 | +| `ucm:save_speed` | Histogram | Speed of saving to UCM (GB/s) | |
| 166 | +| **Lookup Hit Rate Metrics** | | | |
| 167 | +| `ucm:interval_lookup_hit_rates` | Histogram | Hit rate of UCM lookup requests | |
| 168 | + |
| 169 | +## Prometheus Configuration |
| 170 | + |
| 171 | +Metrics configuration is defined in the `unified-cache-management/examples/metrics/metrics_configs.yaml` file: |
| 172 | + |
| 173 | +```yaml |
| 174 | +log_interval: 5 # Interval in seconds for logging metrics |
| 175 | +
|
| 176 | +prometheus: |
| 177 | + multiproc_dir: "/vllm-workspace" # Prometheus directory |
| 178 | + metric_prefix: "ucm:" # Metric name prefix |
| 179 | + |
| 180 | + enabled_metrics: |
| 181 | + counters: true |
| 182 | + gauges: true |
| 183 | + histograms: true |
| 184 | + |
| 185 | + histograms: |
| 186 | + - name: "load_requests_num" |
| 187 | + documentation: "Number of requests loaded from ucm" |
| 188 | + buckets: [1, 5, 10, 20, 50, 100, 200, 500, 1000] |
| 189 | + # ... other metric configurations |
| 190 | +``` |
| 191 | + |
| 192 | +--- |
| 193 | + |
0 commit comments