Add NPD (node problem detector) variant for security-agent-readiness example#154
Conversation
✅ Deploy Preview for node-readiness-controller ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
f1dd942 to
2f1c2d1
Compare
|
|
||
| We can use the Node Readiness Controller to enforce a security readiness guardrail: | ||
| 1. **Taint** the node with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule` as soon as it joins the cluster. | ||
| 1. **Taint** the node with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/security-agent-ready=pending:NoSchedule` as soon as it joins the cluster. |
There was a problem hiding this comment.
readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule is an invalid taint format
❯ kubectl taint nodes security-agent-demo-worker readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule
error: invalid taint spec: readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule, a qualified name must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName', or 'my.name', or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]') with an optional DNS subdomain prefix and '/' (e.g. 'example.com/MyName')
See 'kubectl taint -h' for help and examples
Proposed a fix: #155
There was a problem hiding this comment.
Interesting, I didn't know there's a limitation to have only one domain. Thanks for flagging this!
It maybe beneficial for our usecases to have separate "subdomains" support tho. :(
We could followup further on this requirement later.
There was a problem hiding this comment.
It maybe beneficial for our usecases to have separate "subdomains" support tho. :(
so, the problem is the 2 slashes (/).
one of the ways for subdomain purposes, could be <component>.readiness.k8s.io/security-agent-ready (with CEL support, it might work)
but yes, will discuss it in a followup
2f1c2d1 to
3246330
Compare
ajaysundark
left a comment
There was a problem hiding this comment.
/lgtm
Nice improvements. Thanks for taking a deeper look into this. Consider some suggestions on prefixed conditions, otherwise good to merge.
| "source": "falco-monitor", | ||
| "conditions": [ | ||
| { | ||
| "type": "FalcoProblem", |
There was a problem hiding this comment.
Your documentation uses different condition ('falco.org/FalcoReady') than your usage here.
I prefer the earlier as including the domain name in the node condition also clearly give the ownership.
There was a problem hiding this comment.
updated to use the type falco.org/FalcoNotReady (with the domain part) in the latest commit refresh.
can't use FalcoReady because NPD treats all conditions as problem-oriented (which means exit 0 -> condition=False), so using FalcoReady would result in backwards events (FalcoReady=True when Falco is not up).
The NRR (reporter sidecar) variant still uses falco.org/FalcoReady.
Example output with updated NPD condition
❯ kubectl describe node security-agent-demo-worker
Name: security-agent-demo-worker
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/arch=amd64
kubernetes.io/hostname=security-agent-demo-worker
kubernetes.io/os=linux
Annotations: node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 10 Mar 2026 14:47:03 +0530
Taints: <none>
Unschedulable: false
Lease:
HolderIdentity: security-agent-demo-worker
AcquireTime: <unset>
RenewTime: Tue, 10 Mar 2026 14:59:48 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
falco.org/FalcoNotReady False Tue, 10 Mar 2026 14:59:40 +0530 Tue, 10 Mar 2026 14:59:39 +0530 FalcoHealthy Falco security monitoring is functional
MemoryPressure False Tue, 10 Mar 2026 14:58:06 +0530 Tue, 10 Mar 2026 14:47:03 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 10 Mar 2026 14:58:06 +0530 Tue, 10 Mar 2026 14:47:03 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 10 Mar 2026 14:58:06 +0530 Tue, 10 Mar 2026 14:47:03 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 10 Mar 2026 14:58:06 +0530 Tue, 10 Mar 2026 14:47:17 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 172.20.0.3
Hostname: security-agent-demo-worker
Capacity:
cpu: 16
ephemeral-storage: 974453Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 49039448Ki
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 974453Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 49039448Ki
pods: 110
System Info:
Machine ID: 5831b339edfc434791f95c24f8ce8daf
System UUID: c77ada22-4cad-4515-8e9a-2a4204e7af79
Boot ID: 88ce2b03-78b8-4c5e-aef2-f1e6c58edcb9
Kernel Version: 6.18.8-1-default
OS Image: Debian GNU/Linux 12 (bookworm)
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://2.2.0
Kubelet Version: v1.35.0
Kube-Proxy Version:
PodCIDR: 10.244.1.0/24
PodCIDRs: 10.244.1.0/24
ProviderID: kind://docker/security-agent-demo/security-agent-demo-worker
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
falco falco-9n6w8 100m (0%) 1 (6%) 512Mi (1%) 1Gi (2%) 69s
falco node-problem-detector-falco-xdxt5 20m (0%) 100m (0%) 64Mi (0%) 128Mi (0%) 9m43s
kube-system kindnet-cj2cm 100m (0%) 100m (0%) 50Mi (0%) 50Mi (0%) 12m
kube-system kube-proxy-nj887 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 220m (1%) 1200m (7%)
memory 626Mi (1%) 1202Mi (2%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal RegisteredNode 12m node-controller Node security-agent-demo-worker event: Registered Node security-agent-demo-worker in Controller
Normal TaintAdopted 8m31s node-readiness-controller Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' is now managed by rule 'security-agent-readiness-rule-npd'
Warning FalcoNotDeployed 104s (x2 over 9m4s) falco-monitor Node condition falco.org/FalcoNotReady is now: True, reason: FalcoNotDeployed, message: "Falco is not deployed or not responding on port 8765"
Normal TaintAdded 103s node-readiness-controller Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' added by rule 'security-agent-readiness-rule-npd'
Normal FalcoHealthy 14s (x2 over 5m4s) falco-monitor Node condition falco.org/FalcoNotReady is now: False, reason: FalcoHealthy, message: "Falco security monitoring is functional"
Normal TaintRemoved 13s (x2 over 5m3s) node-readiness-controller Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' removed by rule 'security-agent-readiness-rule-npd'
3246330 to
04a2c07
Compare
04a2c07 to
29669b6
Compare
| This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`. | ||
| #### Option A: Using Node Readiness Reporter Sidecar | ||
|
|
||
| The reporter is deployed as a sidecar container in the Falco DaemonSet. This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`. |
There was a problem hiding this comment.
| The reporter is deployed as a sidecar container in the Falco DaemonSet. This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`. | |
| The reporter is deployed as a sidecar container in the Falco DaemonSet. This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoNotReady`. |
There was a problem hiding this comment.
oh wait, I think I misunderstood, you have used both positive and negative cases for two variants.
It maybe easy for the reader to just pick one to avoid confusion.
ajaysundark
left a comment
There was a problem hiding this comment.
/lgtm
/approve
mostly looks good to me. some minor comments but not very opinionated.
/hold
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ajaysundark, Priyankasaggu11929 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Following chnages in the PR:
This is alongside the existing NRR (node readiness reporter) sidecar reporter approach.
/kind cleanup
/kind documentation
/kind feature
Testing
For local testing, I used the following local steps:
Checklist
make testpassesmake test-e2epassesmake lintpassesmake verifypasses