-
Notifications
You must be signed in to change notification settings - Fork 23
Add support for NVMe drives in smartmon.py #2035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: stackhpc/2025.1
Are you sure you want to change the base?
Conversation
Adds support for collecting SMART metrics from NVMe drives with the use of pysmart and smartctl JSON output. Includes updates to the deployment playbooks, tests, and dashboards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for monitoring NVMe drives by integrating smartctl's JSON output into smartmon.py, removing the dependency on nvme-cli. The changes are comprehensive, covering Ansible playbooks, the monitoring script, tests, and Grafana dashboards. The implementation is generally robust, with good error handling and updated tests. However, I've identified a few issues that could impact monitoring coverage and data accuracy. A logic change in smartmon.py might cause some disks to be skipped from monitoring. Additionally, some Prometheus queries in the updated Grafana dashboards are missing necessary filters, which could lead to incorrect data aggregation and display. I have provided specific suggestions to address these points.
| if not serial_number or not dev.assessment: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script now skips devices if dev.assessment is not available. This is too restrictive and could lead to a loss of monitoring for devices where pySMART fails to determine the health assessment, even if other important metrics like temperature are available. The parse_device_info function already handles cases where assessment is None. To avoid losing visibility on these drives, we should only skip devices if they lack a serial number, which is essential for unique identification.
| if not serial_number or not dev.assessment: | |
| continue | |
| if not serial_number: | |
| continue |
| "definition": "label_values(smartmon_data_units_read,serial_number)", | ||
| "includeAll": false, | ||
| "label": "Serial Number", | ||
| "name": "serial_number", | ||
| "options": [], | ||
| "query": { | ||
| "qryType": 1, | ||
| "query": "label_values(nvme_data_units_read_total,serial_number)", | ||
| "query": "label_values(smartmon_data_units_read,serial_number)", | ||
| "refId": "PrometheusVariableQueryEditor-VariableQuery" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template variable serial_number is populated by querying for label_values(smartmon_data_units_read,serial_number), which will include serial numbers from all disk types, not just NVMe drives. This will cause non-NVMe drives to appear in the dropdown, and selecting them will result in a broken or empty dashboard. The query should be filtered to only include NVMe devices.
"definition": "label_values(smartmon_data_units_read{type=\"nvme\"},serial_number)",
"includeAll": false,
"label": "Serial Number",
"name": "serial_number",
"options": [],
"query": {
"qryType": 1,
"query": "label_values(smartmon_data_units_read{type=\"nvme\"},serial_number)",
"refId": "PrometheusVariableQueryEditor-VariableQuery"
},| "editorMode": "code", | ||
| "expr": "delta(nvme_data_units_written_total{instance=~\"$node\"}[24h])*512000 / nvme_physical_size_bytes{instance=~\"$node\"}", | ||
| "legendFormat": "{{instance}} - {{device}}", | ||
| "expr": "delta(smartmon_data_units_written{instance=~\"$node\"}[24h])*512000 / smartmon_physical_size_bytes{instance=~\"$node\"}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DWPD (Disk Writes Per Day) panel is intended for NVMe drives, but the Prometheus query is missing a type="nvme" filter. Without this filter, the query might incorrectly include metrics from non-NVMe drives if they also export smartmon_data_units_written and smartmon_physical_size_bytes, leading to inaccurate DWPD calculations.
"expr": "delta(smartmon_data_units_written{instance=~\"$node\", type=\"nvme\"}[24h])*512000 / smartmon_physical_size_bytes{instance=~\"$node\", type=\"nvme\"}",| if isinstance(data, dict): | ||
| if isinstance(data.get("stackhpc_dwpd_ratings"), list): | ||
| data_iterable = data["stackhpc_dwpd_ratings"] | ||
| elif isinstance(data.get("dwpd_values"), list): | ||
| data_iterable = data["dwpd_values"] | ||
| else: | ||
| data_iterable = [] | ||
| elif isinstance(data, list): | ||
| data_iterable = data | ||
| else: | ||
| data_iterable = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to determine data_iterable from the loaded JSON data is a bit verbose. It can be simplified for better readability and maintainability while achieving the same result.
| if isinstance(data, dict): | |
| if isinstance(data.get("stackhpc_dwpd_ratings"), list): | |
| data_iterable = data["stackhpc_dwpd_ratings"] | |
| elif isinstance(data.get("dwpd_values"), list): | |
| data_iterable = data["dwpd_values"] | |
| else: | |
| data_iterable = [] | |
| elif isinstance(data, list): | |
| data_iterable = data | |
| else: | |
| data_iterable = [] | |
| data_iterable = [] | |
| if isinstance(data, list): | |
| data_iterable = data | |
| elif isinstance(data, dict): | |
| ratings = data.get("stackhpc_dwpd_ratings", data.get("dwpd_values")) | |
| if isinstance(ratings, list): | |
| data_iterable = ratings |
| except Exception: | ||
| nvme_json = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching a broad Exception can hide unexpected errors and make debugging more difficult. It's better to catch more specific exceptions that are expected from the smartctl command execution and JSON parsing, such as errors related to subprocess execution, file not found, or JSON decoding.
| except Exception: | |
| nvme_json = {} | |
| except (subprocess.SubprocessError, json.JSONDecodeError, FileNotFoundError, OSError): | |
| nvme_json = {} |
| }, | ||
| "editorMode": "code", | ||
| "expr": "nvme_data_units_written_total{serial_number=~\"$serial_number\"}", | ||
| "expr": "smartmon_data_units_written{serial_number=~\"$serial_number\"}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While filtering by $serial_number is likely sufficient to target a specific device, adding a type="nvme" filter to all queries in this NVMe-specific dashboard would make them more robust and explicit. This prevents potential issues if serial numbers are not unique across different disk types.
"expr": "smartmon_data_units_written{serial_number=~\"$serial_number\", type=\"nvme\"}",
Adds support for collecting SMART metrics from NVMe drives with the use of pysmart and smartctl JSON output. Includes updates to the deployment playbooks, tests, and dashboards.