Skip to content

Commit 84f0bc2

Browse files
craig[bot]dhartuniandt
committed
158219: server: add vCPU count to diagnostics and status r=dhartunian a=dhartunian Previously, vCPU was just the OS CPU count as reported by the runtime. This change adds a `GetVCPUs` function that's cgroup aware and uses it to compute a new num_vcpus column on the node status protobuf. This new column is used to display stats on the overview page. Resolves: CRDB-54703 Epic: None Release note (ui change): Previously, we would incorrectly report operating system CPU counts on the DB Console overview page even though the column was labeled `vCPUs`. This change fixes the reporting to measure and report vCPUs correctly using cgroups. This should now reflect reserved compute in Kubernetes and other virtualized environments. 158364: ui/jobs: show pause reasons in jobs UI as warnings r=dt a=dt Previously we only showed 'error' for paused jobs, but pause reasons don't show up in error (at least not since 24.x). This extends the UI to look for them in `running_status`, and also treat pause statuses with added advisory reasons as attention-worth (yellow) warnings instead of grey normal pauses. 158388: cli/demo: start job scheduler right away r=dt a=dt Previously we started it quickly but only in the system tenant. Now we start it quickly in all. Also run it more often. Release note: none. Epic: none. 158518: cloud/azure: default enable caching SDK clients r=dt a=dt This mirrors the default in s3 client handling. Release note: none. Epic: none. Co-authored-by: David Hartunian <davidh@cockroachlabs.com> Co-authored-by: David Taylor <davidt@davidt.io>
5 parents 10111a0 + eae4237 + 82f110f + 576175a + 470d4ed commit 84f0bc2

File tree

21 files changed

+176
-26
lines changed

21 files changed

+176
-26
lines changed

docs/generated/http/full.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -349,7 +349,8 @@ NodeStatus records the most recent values of metrics for a node.
349349
| latencies | [NodeStatus.LatenciesEntry](#cockroach.server.serverpb.NodesResponse-cockroach.server.status.statuspb.NodeStatus.LatenciesEntry) | repeated | latencies is a map of nodeIDs to nanoseconds which is the latency between this node and the other node.<br><br>NOTE: this is deprecated and is only set if the min supported cluster version is >= VersionRPCNetworkStats. | [reserved](#support-status) |
350350
| activity | [NodeStatus.ActivityEntry](#cockroach.server.serverpb.NodesResponse-cockroach.server.status.statuspb.NodeStatus.ActivityEntry) | repeated | activity is a map of nodeIDs to network statistics from this node to other nodes. | [reserved](#support-status) |
351351
| total_system_memory | [int64](#cockroach.server.serverpb.NodesResponse-int64) | | total_system_memory is the total RAM available to the system (or, if detected, the memory available to the cgroup this process is in) in bytes. | [alpha](#support-status) |
352-
| num_cpus | [int32](#cockroach.server.serverpb.NodesResponse-int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. Note that this does not report the number of CPUs actually used by `cockroach`; this parameter is controlled separately. | [alpha](#support-status) |
352+
| num_cpus | [int32](#cockroach.server.serverpb.NodesResponse-int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. This reflects the physical CPU count and does not account for container/cgroup limits. See num_vcpus for container-aware CPU allocation. | [alpha](#support-status) |
353+
| num_vcpus | [double](#cockroach.server.serverpb.NodesResponse-double) | | num_vcpus is the number of vCPUs allocated to the process by the container orchestrator (e.g., Kubernetes, Docker) based on cgroup CPU quota/period. This represents the platform CPU allocation and is independent of GOMAXPROCS runtime tuning. Falls back to num_cpus if no container limits are configured. Supports fractional values (e.g., 1.5 for Kubernetes CPU limits like "1500m"). | [alpha](#support-status) |
353354

354355

355356

@@ -501,7 +502,8 @@ NodeStatus records the most recent values of metrics for a node.
501502
| latencies | [NodeStatus.LatenciesEntry](#cockroach.server.status.statuspb.NodeStatus-cockroach.server.status.statuspb.NodeStatus.LatenciesEntry) | repeated | latencies is a map of nodeIDs to nanoseconds which is the latency between this node and the other node.<br><br>NOTE: this is deprecated and is only set if the min supported cluster version is >= VersionRPCNetworkStats. | [reserved](#support-status) |
502503
| activity | [NodeStatus.ActivityEntry](#cockroach.server.status.statuspb.NodeStatus-cockroach.server.status.statuspb.NodeStatus.ActivityEntry) | repeated | activity is a map of nodeIDs to network statistics from this node to other nodes. | [reserved](#support-status) |
503504
| total_system_memory | [int64](#cockroach.server.status.statuspb.NodeStatus-int64) | | total_system_memory is the total RAM available to the system (or, if detected, the memory available to the cgroup this process is in) in bytes. | [alpha](#support-status) |
504-
| num_cpus | [int32](#cockroach.server.status.statuspb.NodeStatus-int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. Note that this does not report the number of CPUs actually used by `cockroach`; this parameter is controlled separately. | [alpha](#support-status) |
505+
| num_cpus | [int32](#cockroach.server.status.statuspb.NodeStatus-int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. This reflects the physical CPU count and does not account for container/cgroup limits. See num_vcpus for container-aware CPU allocation. | [alpha](#support-status) |
506+
| num_vcpus | [double](#cockroach.server.status.statuspb.NodeStatus-double) | | num_vcpus is the number of vCPUs allocated to the process by the container orchestrator (e.g., Kubernetes, Docker) based on cgroup CPU quota/period. This represents the platform CPU allocation and is independent of GOMAXPROCS runtime tuning. Falls back to num_cpus if no container limits are configured. Supports fractional values (e.g., 1.5 for Kubernetes CPU limits like "1500m"). | [alpha](#support-status) |
505507

506508

507509

@@ -656,6 +658,7 @@ NodeStatus records the most recent values of metrics for a node.
656658
| activity | [NodeResponse.ActivityEntry](#cockroach.server.serverpb.NodesResponseExternal-cockroach.server.serverpb.NodeResponse.ActivityEntry) | repeated | activity is a map of nodeIDs to network statistics from this node to other nodes. | [reserved](#support-status) |
657659
| total_system_memory | [int64](#cockroach.server.serverpb.NodesResponseExternal-int64) | | total_system_memory is the total RAM available to the system (or, if detected, the memory available to the cgroup this process is in) in bytes. | [alpha](#support-status) |
658660
| num_cpus | [int32](#cockroach.server.serverpb.NodesResponseExternal-int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. Note that this does not report the number of CPUs actually used by `cockroach`; this parameter is controlled separately. | [alpha](#support-status) |
661+
| num_vcpus | [double](#cockroach.server.serverpb.NodesResponseExternal-double) | | num_vcpus is the number of provisioned vCPUs as reported by cgroups or the operating system. | [reserved](#support-status) |
659662

660663

661664

@@ -914,6 +917,7 @@ NodeStatus records the most recent values of metrics for a node.
914917
| activity | [NodeResponse.ActivityEntry](#cockroach.server.serverpb.NodeResponse-cockroach.server.serverpb.NodeResponse.ActivityEntry) | repeated | activity is a map of nodeIDs to network statistics from this node to other nodes. | [reserved](#support-status) |
915918
| total_system_memory | [int64](#cockroach.server.serverpb.NodeResponse-int64) | | total_system_memory is the total RAM available to the system (or, if detected, the memory available to the cgroup this process is in) in bytes. | [alpha](#support-status) |
916919
| num_cpus | [int32](#cockroach.server.serverpb.NodeResponse-int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. Note that this does not report the number of CPUs actually used by `cockroach`; this parameter is controlled separately. | [alpha](#support-status) |
920+
| num_vcpus | [double](#cockroach.server.serverpb.NodeResponse-double) | | num_vcpus is the number of provisioned vCPUs as reported by cgroups or the operating system. | [reserved](#support-status) |
917921

918922

919923

docs/generated/http/nodes-other.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ Support status: [alpha](#support-status)
2121
| latencies | [NodeStatus.LatenciesEntry](#cockroach.server.status.statuspb.NodeStatus.LatenciesEntry) | repeated | latencies is a map of nodeIDs to nanoseconds which is the latency between this node and the other node.<br><br>NOTE: this is deprecated and is only set if the min supported cluster version is >= VersionRPCNetworkStats. | [reserved](#support-status) |
2222
| activity | [NodeStatus.ActivityEntry](#cockroach.server.status.statuspb.NodeStatus.ActivityEntry) | repeated | activity is a map of nodeIDs to network statistics from this node to other nodes. | [reserved](#support-status) |
2323
| total_system_memory | [int64](#int64) | | total_system_memory is the total RAM available to the system (or, if detected, the memory available to the cgroup this process is in) in bytes. | [alpha](#support-status) |
24-
| num_cpus | [int32](#int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. Note that this does not report the number of CPUs actually used by `cockroach`; this parameter is controlled separately. | [alpha](#support-status) |
24+
| num_cpus | [int32](#int32) | | num_cpus is the number of logical CPUs as reported by the operating system on the host where the `cockroach` process is running. This reflects the physical CPU count and does not account for container/cgroup limits. See num_vcpus for container-aware CPU allocation. | [alpha](#support-status) |
25+
| num_vcpus | [double](#double) | | num_vcpus is the number of vCPUs allocated to the process by the container orchestrator (e.g., Kubernetes, Docker) based on cgroup CPU quota/period. This represents the platform CPU allocation and is independent of GOMAXPROCS runtime tuning. Falls back to num_cpus if no container limits are configured. Supports fractional values (e.g., 1.5 for Kubernetes CPU limits like "1500m"). | [alpha](#support-status) |
2526

2627

2728

pkg/cli/democluster/demo_cluster.go

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -535,6 +535,15 @@ func (c *transientCluster) startTenantService(
535535
InjectedLatencyEnabled: c.latencyEnabled.Load,
536536
},
537537
},
538+
JobsTestingKnobs: &jobs.TestingKnobs{
539+
// Allow the scheduler daemon to start earlier in demo.
540+
SchedulerDaemonInitialScanDelay: func() time.Duration {
541+
return time.Second * 2
542+
},
543+
SchedulerDaemonScanDelay: func() time.Duration {
544+
return time.Second * 5
545+
},
546+
},
538547
},
539548
}
540549

@@ -563,6 +572,15 @@ func (c *transientCluster) startTenantService(
563572
InjectedLatencyEnabled: c.latencyEnabled.Load,
564573
},
565574
},
575+
JobsTestingKnobs: &jobs.TestingKnobs{
576+
// Allow the scheduler daemon to start earlier in demo.
577+
SchedulerDaemonInitialScanDelay: func() time.Duration {
578+
return time.Second * 2
579+
},
580+
SchedulerDaemonScanDelay: func() time.Duration {
581+
return time.Second * 5
582+
},
583+
},
566584
},
567585
})
568586
if err != nil {
@@ -924,7 +942,10 @@ func (demoCtx *Context) testServerArgsForTransientCluster(
924942
JobsTestingKnobs: &jobs.TestingKnobs{
925943
// Allow the scheduler daemon to start earlier in demo.
926944
SchedulerDaemonInitialScanDelay: func() time.Duration {
927-
return time.Second * 15
945+
return time.Second * 2
946+
},
947+
SchedulerDaemonScanDelay: func() time.Duration {
948+
return time.Second * 5
928949
},
929950
},
930951
},

pkg/cloud/azure/azure_storage.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ var reuseSession = settings.RegisterBoolSetting(
6060
settings.ApplicationLevel,
6161
"cloudstorage.azure.session_reuse.enabled",
6262
"persist the last opened azure client and re-use it when opening a new client with the same argument (some settings may take 2mins to take effect)",
63-
false,
63+
true,
6464
)
6565

6666
// A note on Azure authentication:

pkg/server/api_v2_ranges.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,8 @@ type nodeStatus struct {
5050
TotalSystemMemory int64 `json:"total_system_memory,omitempty"`
5151
// NumCpus is the number of CPUs on this node.
5252
NumCpus int32 `json:"num_cpus,omitempty"`
53+
// NumVcpus is the number of vCPUs on this node.
54+
NumVcpus float64 `json:"num_vcpus,omitempty"`
5355
// UpdatedAt is the time at which the node status record was last updated,
5456
// in nanoseconds since Unix epoch.
5557
UpdatedAt int64 `json:"updated_at,omitempty"`
@@ -128,6 +130,7 @@ func (a *apiV2SystemServer) listNodes(w http.ResponseWriter, r *http.Request) {
128130
StoreMetrics: storeMetrics,
129131
TotalSystemMemory: n.TotalSystemMemory,
130132
NumCpus: n.NumCpus,
133+
NumVcpus: n.NumVcpus,
131134
UpdatedAt: n.UpdatedAt,
132135
LivenessStatus: int32(nodes.LivenessByNodeID[n.Desc.NodeID]),
133136
})

pkg/server/diagnostics/BUILD.bazel

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ go_library(
2020
"//pkg/kv",
2121
"//pkg/roachpb",
2222
"//pkg/server/diagnostics/diagnosticspb",
23+
"//pkg/server/status",
2324
"//pkg/server/status/statuspb",
2425
"//pkg/server/telemetry",
2526
"//pkg/settings",

pkg/server/diagnostics/diagnostics.go

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ import (
1515
"github.com/cockroachdb/cockroach/pkg/ccl/utilccl/licenseccl"
1616
"github.com/cockroachdb/cockroach/pkg/roachpb"
1717
"github.com/cockroachdb/cockroach/pkg/server/diagnostics/diagnosticspb"
18+
"github.com/cockroachdb/cockroach/pkg/server/status"
1819
"github.com/cockroachdb/cockroach/pkg/util/cloudinfo"
1920
"github.com/cockroachdb/cockroach/pkg/util/envutil"
2021
"github.com/cockroachdb/cockroach/pkg/util/log"
@@ -183,6 +184,7 @@ func populateHardwareInfo(ctx context.Context, e *diagnosticspb.Environment) {
183184
}
184185

185186
e.Hardware.Cpu.Numcpu = int32(system.NumCPU())
187+
e.Hardware.Cpu.Numvcpu = float32(status.GetVCPUs(ctx))
186188
if cpus, err := cpu.InfoWithContext(ctx); err == nil && len(cpus) > 0 {
187189
e.Hardware.Cpu.Sockets = int32(len(cpus))
188190
c := cpus[0]

pkg/server/diagnostics/diagnosticspb/diagnostics.proto

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,7 @@ message CPUInfo {
109109
string model = 4; // reported model name e.g. `Intel(R) Core(TM) i7-7920HQ CPU @ 3.10GHz`
110110
float mhz = 5; // speed of first cpu e.g. 3100
111111
repeated string features = 6; // cpu feature flags for first cpu
112+
float numvcpu = 7; // container-aware vCPU allocation from cgroup limits
112113
}
113114

114115
message HardwareInfo {

pkg/server/nodes_response.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,7 @@ func nodeStatusToResp(n *statuspb.NodeStatus, hasViewClusterMetadata bool) serve
147147
Activity: activity,
148148
TotalSystemMemory: n.TotalSystemMemory,
149149
NumCpus: n.NumCpus,
150+
NumVcpus: n.NumVcpus,
150151
}
151152

152153
if hasViewClusterMetadata {

pkg/server/serverpb/status.proto

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,10 @@ message NodeResponse {
298298
// this parameter is controlled separately.
299299
// API: PUBLIC ALPHA
300300
int32 num_cpus = 12;
301+
302+
// num_vcpus is the number of provisioned vCPUs as reported by
303+
// cgroups or the operating system.
304+
double num_vcpus = 13;
301305
}
302306

303307
// RegionsRequest requests all available regions.

0 commit comments

Comments
 (0)