Skip to content

Conversation

@technowhizz
Copy link
Contributor

Adds support for collecting SMART metrics from NVMe drives with the use of pysmart and smartctl JSON output. Includes updates to the deployment playbooks, tests, and dashboards.

Adds support for collecting SMART metrics from NVMe drives with the use
of pysmart and smartctl JSON output. Includes updates to the deployment
playbooks, tests, and dashboards.
@technowhizz technowhizz requested a review from a team as a code owner December 12, 2025 11:00
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for monitoring NVMe drives by integrating smartctl's JSON output into smartmon.py, removing the dependency on nvme-cli. The changes are comprehensive, covering Ansible playbooks, the monitoring script, tests, and Grafana dashboards. The implementation is generally robust, with good error handling and updated tests. However, I've identified a few issues that could impact monitoring coverage and data accuracy. A logic change in smartmon.py might cause some disks to be skipped from monitoring. Additionally, some Prometheus queries in the updated Grafana dashboards are missing necessary filters, which could lead to incorrect data aggregation and display. I have provided specific suggestions to address these points.

Comment on lines +484 to +485
if not serial_number or not dev.assessment:
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script now skips devices if dev.assessment is not available. This is too restrictive and could lead to a loss of monitoring for devices where pySMART fails to determine the health assessment, even if other important metrics like temperature are available. The parse_device_info function already handles cases where assessment is None. To avoid losing visibility on these drives, we should only skip devices if they lack a serial number, which is essential for unique identification.

Suggested change
if not serial_number or not dev.assessment:
continue
if not serial_number:
continue

Comment on lines +1318 to 1326
"definition": "label_values(smartmon_data_units_read,serial_number)",
"includeAll": false,
"label": "Serial Number",
"name": "serial_number",
"options": [],
"query": {
"qryType": 1,
"query": "label_values(nvme_data_units_read_total,serial_number)",
"query": "label_values(smartmon_data_units_read,serial_number)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The template variable serial_number is populated by querying for label_values(smartmon_data_units_read,serial_number), which will include serial numbers from all disk types, not just NVMe drives. This will cause non-NVMe drives to appear in the dropdown, and selecting them will result in a broken or empty dashboard. The query should be filtered to only include NVMe devices.

        "definition": "label_values(smartmon_data_units_read{type=\"nvme\"},serial_number)",
        "includeAll": false,
        "label": "Serial Number",
        "name": "serial_number",
        "options": [],
        "query": {
          "qryType": 1,
          "query": "label_values(smartmon_data_units_read{type=\"nvme\"},serial_number)",
          "refId": "PrometheusVariableQueryEditor-VariableQuery"
        },

"editorMode": "code",
"expr": "delta(nvme_data_units_written_total{instance=~\"$node\"}[24h])*512000 / nvme_physical_size_bytes{instance=~\"$node\"}",
"legendFormat": "{{instance}} - {{device}}",
"expr": "delta(smartmon_data_units_written{instance=~\"$node\"}[24h])*512000 / smartmon_physical_size_bytes{instance=~\"$node\"}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The DWPD (Disk Writes Per Day) panel is intended for NVMe drives, but the Prometheus query is missing a type="nvme" filter. Without this filter, the query might incorrectly include metrics from non-NVMe drives if they also export smartmon_data_units_written and smartmon_physical_size_bytes, leading to inaccurate DWPD calculations.

          "expr": "delta(smartmon_data_units_written{instance=~\"$node\", type=\"nvme\"}[24h])*512000 / smartmon_physical_size_bytes{instance=~\"$node\", type=\"nvme\"}",

Comment on lines +129 to +139
if isinstance(data, dict):
if isinstance(data.get("stackhpc_dwpd_ratings"), list):
data_iterable = data["stackhpc_dwpd_ratings"]
elif isinstance(data.get("dwpd_values"), list):
data_iterable = data["dwpd_values"]
else:
data_iterable = []
elif isinstance(data, list):
data_iterable = data
else:
data_iterable = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine data_iterable from the loaded JSON data is a bit verbose. It can be simplified for better readability and maintainability while achieving the same result.

Suggested change
if isinstance(data, dict):
if isinstance(data.get("stackhpc_dwpd_ratings"), list):
data_iterable = data["stackhpc_dwpd_ratings"]
elif isinstance(data.get("dwpd_values"), list):
data_iterable = data["dwpd_values"]
else:
data_iterable = []
elif isinstance(data, list):
data_iterable = data
else:
data_iterable = []
data_iterable = []
if isinstance(data, list):
data_iterable = data
elif isinstance(data, dict):
ratings = data.get("stackhpc_dwpd_ratings", data.get("dwpd_values"))
if isinstance(ratings, list):
data_iterable = ratings

Comment on lines +324 to +325
except Exception:
nvme_json = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions that are expected from the smartctl command execution and JSON parsing, such as errors related to subprocess execution, file not found, or JSON decoding.

Suggested change
except Exception:
nvme_json = {}
except (subprocess.SubprocessError, json.JSONDecodeError, FileNotFoundError, OSError):
nvme_json = {}

},
"editorMode": "code",
"expr": "nvme_data_units_written_total{serial_number=~\"$serial_number\"}",
"expr": "smartmon_data_units_written{serial_number=~\"$serial_number\"}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While filtering by $serial_number is likely sufficient to target a specific device, adding a type="nvme" filter to all queries in this NVMe-specific dashboard would make them more robust and explicit. This prevents potential issues if serial numbers are not unique across different disk types.

          "expr": "smartmon_data_units_written{serial_number=~\"$serial_number\", type=\"nvme\"}",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants