Add support for NVMe drives in smartmon.py #2035

technowhizz · 2025-12-12T11:00:50Z

Adds support for collecting SMART metrics from NVMe drives with the use of pysmart and smartctl JSON output. Includes updates to the deployment playbooks, tests, and dashboards.

gemini-code-assist

Code Review

This pull request adds support for monitoring NVMe drives by integrating smartctl's JSON output into smartmon.py, removing the dependency on nvme-cli. The changes are comprehensive, covering Ansible playbooks, the monitoring script, tests, and Grafana dashboards. The implementation is generally robust, with good error handling and updated tests. However, I've identified a few issues that could impact monitoring coverage and data accuracy. A logic change in smartmon.py might cause some disks to be skipped from monitoring. Additionally, some Prometheus queries in the updated Grafana dashboards are missing necessary filters, which could lead to incorrect data aggregation and display. I have provided specific suggestions to address these points.

gemini-code-assist · 2025-12-12T11:03:35Z

etc/kayobe/ansible/scripts/smartmon.py

+        if not serial_number or not dev.assessment:
+            continue


The script now skips devices if dev.assessment is not available. This is too restrictive and could lead to a loss of monitoring for devices where pySMART fails to determine the health assessment, even if other important metrics like temperature are available. The parse_device_info function already handles cases where assessment is None. To avoid losing visibility on these drives, we should only skip devices if they lack a serial number, which is essential for unique identification.

Suggested change

if not serial_number or not dev.assessment:

continue

if not serial_number:

continue

gemini-code-assist · 2025-12-12T11:03:35Z

etc/kayobe/kolla/config/grafana/dashboards/openstack/nvme.json

+        "definition": "label_values(smartmon_data_units_read,serial_number)",
        "includeAll": false,
        "label": "Serial Number",
        "name": "serial_number",
        "options": [],
        "query": {
          "qryType": 1,
-          "query": "label_values(nvme_data_units_read_total,serial_number)",
+          "query": "label_values(smartmon_data_units_read,serial_number)",
          "refId": "PrometheusVariableQueryEditor-VariableQuery"


The template variable serial_number is populated by querying for label_values(smartmon_data_units_read,serial_number), which will include serial numbers from all disk types, not just NVMe drives. This will cause non-NVMe drives to appear in the dropdown, and selecting them will result in a broken or empty dashboard. The query should be filtered to only include NVMe devices.

"definition": "label_values(smartmon_data_units_read{type=\"nvme\"},serial_number)", "includeAll": false, "label": "Serial Number", "name": "serial_number", "options": [], "query": { "qryType": 1, "query": "label_values(smartmon_data_units_read{type=\"nvme\"},serial_number)", "refId": "PrometheusVariableQueryEditor-VariableQuery" },

gemini-code-assist · 2025-12-12T11:03:35Z

etc/kayobe/kolla/config/grafana/dashboards/openstack/hardware_overview.json

          "editorMode": "code",
-          "expr": "delta(nvme_data_units_written_total{instance=~\"$node\"}[24h])*512000 / nvme_physical_size_bytes{instance=~\"$node\"}",
-          "legendFormat": "{{instance}} - {{device}}",
+          "expr": "delta(smartmon_data_units_written{instance=~\"$node\"}[24h])*512000 / smartmon_physical_size_bytes{instance=~\"$node\"}",


The DWPD (Disk Writes Per Day) panel is intended for NVMe drives, but the Prometheus query is missing a type="nvme" filter. Without this filter, the query might incorrectly include metrics from non-NVMe drives if they also export smartmon_data_units_written and smartmon_physical_size_bytes, leading to inaccurate DWPD calculations.

"expr": "delta(smartmon_data_units_written{instance=~\"$node\", type=\"nvme\"}[24h])*512000 / smartmon_physical_size_bytes{instance=~\"$node\", type=\"nvme\"}",

gemini-code-assist · 2025-12-12T11:03:35Z

etc/kayobe/ansible/scripts/smartmon.py

+    if isinstance(data, dict):
+        if isinstance(data.get("stackhpc_dwpd_ratings"), list):
+            data_iterable = data["stackhpc_dwpd_ratings"]
+        elif isinstance(data.get("dwpd_values"), list):
+            data_iterable = data["dwpd_values"]
+        else:
+            data_iterable = []
+    elif isinstance(data, list):
+        data_iterable = data
+    else:
+        data_iterable = []


The logic to determine data_iterable from the loaded JSON data is a bit verbose. It can be simplified for better readability and maintainability while achieving the same result.

Suggested change

if isinstance(data, dict):

if isinstance(data.get("stackhpc_dwpd_ratings"), list):

data_iterable = data["stackhpc_dwpd_ratings"]

elif isinstance(data.get("dwpd_values"), list):

data_iterable = data["dwpd_values"]

else:

data_iterable = []

elif isinstance(data, list):

data_iterable = data

else:

data_iterable = []

data_iterable = []

if isinstance(data, list):

data_iterable = data

elif isinstance(data, dict):

ratings = data.get("stackhpc_dwpd_ratings", data.get("dwpd_values"))

if isinstance(ratings, list):

data_iterable = ratings

gemini-code-assist · 2025-12-12T11:03:35Z

etc/kayobe/ansible/scripts/smartmon.py

+    except Exception:
+        nvme_json = {}


Catching a broad Exception can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions that are expected from the smartctl command execution and JSON parsing, such as errors related to subprocess execution, file not found, or JSON decoding.

Suggested change

except Exception:

nvme_json = {}

except (subprocess.SubprocessError, json.JSONDecodeError, FileNotFoundError, OSError):

nvme_json = {}

gemini-code-assist · 2025-12-12T11:03:36Z

etc/kayobe/kolla/config/grafana/dashboards/openstack/nvme.json

          },
          "editorMode": "code",
-          "expr": "nvme_data_units_written_total{serial_number=~\"$serial_number\"}",
+          "expr": "smartmon_data_units_written{serial_number=~\"$serial_number\"}",


While filtering by $serial_number is likely sufficient to target a specific device, adding a type="nvme" filter to all queries in this NVMe-specific dashboard would make them more robust and explicit. This prevents potential issues if serial numbers are not unique across different disk types.

"expr": "smartmon_data_units_written{serial_number=~\"$serial_number\", type=\"nvme\"}",

Add support for NVMe drives in smartmon.py

d628abc

Adds support for collecting SMART metrics from NVMe drives with the use of pysmart and smartctl JSON output. Includes updates to the deployment playbooks, tests, and dashboards.

technowhizz requested a review from a team as a code owner December 12, 2025 11:00

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

technowhizz temporarily deployed to Leafcloud December 12, 2025 11:09 — with GitHub Actions Inactive

technowhizz temporarily deployed to SMS Lab December 12, 2025 11:09 — with GitHub Actions Inactive

technowhizz temporarily deployed to Leafcloud December 12, 2025 11:09 — with GitHub Actions Inactive

technowhizz temporarily deployed to SMS Lab December 12, 2025 11:09 — with GitHub Actions Inactive

technowhizz temporarily deployed to Leafcloud December 12, 2025 11:09 — with GitHub Actions Inactive

technowhizz temporarily deployed to SMS Lab December 12, 2025 11:09 — with GitHub Actions Inactive

technowhizz temporarily deployed to Leafcloud December 12, 2025 11:09 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for NVMe drives in smartmon.py #2035

Add support for NVMe drives in smartmon.py #2035

technowhizz commented Dec 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

gemini-code-assist bot Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add support for NVMe drives in smartmon.py #2035

Are you sure you want to change the base?

Add support for NVMe drives in smartmon.py #2035

Conversation

technowhizz commented Dec 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants