Add metrics for VM states #2649

yuriadam-sap · 2025-12-04T11:39:02Z

What is this change about?

This adds new bosh metrics regarding the VM states. Initially the idea was proposed in this PR that which was reverted here due to breaking the system design.

Please provide contextual information.

Initial PR which was reverted defined the state "unhealthy" if the VM state is not "running". In this PR, each state has its own metric, and there is additional metric "unhealthy" when the VM state is running but doesn't have any processes. This requires modification in the agent heartbeat to include the number of processes. The corresponding PR for the agent heartbeat is here.

The new metrics are:

bosh_unhealthy_agents
bosh_total_available_agents
bosh_failing_instances
bosh_stopped_instances
bosh_unknown_instances

What tests have you run against this PR?

Ran the unit tests and also tested a bosh dev release on a development environment with the modified bosh-agent.

How should this change be described in bosh release notes?

Bosh now emits different states of VM as metrics, which can be used to monitor the state of the instances per deployment over time

Does this PR introduce a breaking change?

No

Tag your pair, your PM, and/or team!

@DennisAhausSAP

- Update Heartbeat.to_hash to include process_length attribute when present - Extract process_length from heartbeat payload in Riemann plugin - Attach process_length to each Riemann event (if present) - Support both symbol and string keys for payload compatibility - Add unit tests verifying process_length inclusion/omission in events - All unit tests pass (Riemann: 5/5, Heartbeat: 10/10)

aramprice · 2025-12-04T16:22:49Z

Cross linking agent changes cloudfoundry/bosh-agent#390

Copilot

Pull request overview

This PR adds new metrics for tracking VM states in BOSH, including unhealthy, failing, stopped, and unknown instances. The implementation introduces five new Prometheus metrics and corresponding API endpoints to expose these state counts per deployment, with agents reporting the number of running processes to detect unhealthy states.

Key changes:

Adds number_of_processes and job_state tracking to the Agent class
Introduces five new metrics methods in InstanceManager for counting agents by state
Creates corresponding API endpoints and Prometheus gauge metrics for each state

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
src/bosh-monitor/lib/bosh/monitor/agent.rb	Adds `job_state` and `number_of_processes` attributes to track agent health state
src/bosh-monitor/lib/bosh/monitor/instance_manager.rb	Implements five new metric calculation methods (unhealthy, failing, stopped, unknown, total_available agents)
src/bosh-monitor/lib/bosh/monitor/api_controller.rb	Adds five new GET endpoints to expose the VM state metrics
src/bosh-monitor/lib/bosh/monitor/events/heartbeat.rb	Updates heartbeat to include `number_of_processes` in the serialized hash
src/bosh-director/lib/bosh/director/metrics_collector.rb	Registers five new Prometheus gauges and fetches/populates metrics from health monitor endpoints
src/bosh-monitor/spec/unit/bosh/monitor/instance_manager_spec.rb	Adds comprehensive test coverage for all five new metric methods
src/bosh-monitor/spec/unit/bosh/monitor/api_controller_spec.rb	Adds tests for all five new API endpoints including 503 responses
src/bosh-monitor/spec/unit/bosh/monitor/agent_spec.rb	Tests that `update_instance` preserves `job_state` and `number_of_processes`
src/bosh-director/spec/unit/bosh/director/metrics_collector_spec.rb	Tests Prometheus metric registration and population for all new VM state metrics

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-18T16:00:17Z

src/bosh-monitor/lib/bosh/monitor/instance_manager.rb

+    def failing_instances
+      agents_hash = {}
+      @deployment_name_to_deployments.each do |name, deployment|
+        agents_hash[name] = deployment.agents.count { |agent| agent.job_state == 'failing' }


Inconsistent quote style: all other methods in this file use double quotes for strings, but this method uses single quotes. Consider using double quotes for consistency.

Copilot · 2025-12-18T16:00:17Z

src/bosh-monitor/lib/bosh/monitor/api_controller.rb

      end
    end
+
+    get "/unhealthy_agents" do


Inconsistent quote style in the path parameter: uses double quotes while the previous unresponsive_agents endpoint uses single quotes. Consider using single quotes consistently as seen throughout the rest of the file.

Suggested change

get "/unhealthy_agents" do

get '/unhealthy_agents' do

Copilot · 2025-12-18T16:00:17Z