Skip to content

Add NPD (node problem detector) variant for security-agent-readiness example#154

Open
Priyankasaggu11929 wants to merge 3 commits intokubernetes-sigs:mainfrom
Priyankasaggu11929:npd-security-agent-example
Open

Add NPD (node problem detector) variant for security-agent-readiness example#154
Priyankasaggu11929 wants to merge 3 commits intokubernetes-sigs:mainfrom
Priyankasaggu11929:npd-security-agent-example

Conversation

@Priyankasaggu11929
Copy link
Member

@Priyankasaggu11929 Priyankasaggu11929 commented Mar 7, 2026

Following chnages in the PR:

  • adds an option to use NPD (node problem detector) for componet status probing and adding new node status condition, in the existing security-agent-readiness (Falco) example.
    This is alongside the existing NRR (node readiness reporter) sidecar reporter approach.
  • reorganizes examples into variant-specific directories (nrr-variant/ and npd-variant/)
  • fixes taint format, RBAC permissions etc issues with the NRR sidecar implementation

/kind cleanup
/kind documentation
/kind feature

Testing

For local testing, I used the following local steps:

# create kind cluster
kind create cluster --config examples/security-agent-readiness/kind-cluster-config.yaml

# deploy NRC controller  
make docker-build IMG=controller:latest  
kind load docker-image controller:latest --name security-agent-demo  
kubectl apply -f config/crd/bases/  
make deploy IMG=controller:latest

kubectl patch deployment nrr-controller-manager -n nrr-system --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/tolerations",
    "value": [
      {
        "operator": "Exists"
      }
    ]
  }
]'

# for NPD
USE_NPD=true ./examples/security-agent-readiness/setup-falco.sh  
kubectl apply -f examples/security-agent-readiness/npd-variant/security-agent-readiness-rule-npd.yaml  

# add tolerations to the falco daemonset to schedule it on the tainted node
./examples/security-agent-readiness/add-falco-toleration.sh

## in case of NPD, following node status condition is added
kubectl get node security-agent-demo-worker -o json | jq .status.conditions 
[
  {
    "lastHeartbeatTime": "2026-03-07T20:14:43Z",
    "lastTransitionTime": "2026-03-07T19:44:39Z",
    "message": "Falco security monitoring is functional",
    "reason": "FalcoHealthy",
    "status": "False",
    "type": "falco.org/FalcoNotReady"
  },
...

---

# for NRR sidecar
USE_NRR=true ./examples/security-agent-readiness/setup-falco.sh
kubectl apply -f examples/security-agent-readiness/nrr-variant/security-agent-readiness-rule.yaml

# add tolerations to the falco daemonset to schedule it on the tainted node
./examples/security-agent-readiness/add-falco-toleration.sh

## in case of NRR, following node status condition is added
kubectl get node security-agent-demo-worker -o json | jq .status.conditions
  {
    "lastHeartbeatTime": "2026-03-07T20:19:58Z",
    "lastTransitionTime": "2026-03-07T20:19:58Z",
    "message": "Endpoint reports ready at http://localhost:8765/healthz",
    "reason": "EndpointOK",
    "status": "True",
    "type": "falco.org/FalcoReady"
  }

---

# Following above required node status conditions are met, the worker node taints are lifted by the Node readiness controller

## for example, NPD:

❯ kubectl describe node security-agent-demo-worker 
Name:               security-agent-demo-worker
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=security-agent-demo-worker
                    kubernetes.io/os=linux
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 10 Mar 2026 14:47:03 +0530
Taints:             <none>
Unschedulable:      false
Lease:
  ...
Conditions:
  Type                      Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                      ------  -----------------                 ------------------                ------                       -------
  falco.org/FalcoNotReady   False   Tue, 10 Mar 2026 14:59:40 +0530   Tue, 10 Mar 2026 14:59:39 +0530   FalcoHealthy                 Falco security monitoring is functional
  MemoryPressure            False   Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:03 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure              False   Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:03 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure               False   Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:03 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                     True    Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:17 +0530   KubeletReady                 kubelet is posting ready status
...
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
  falco                       falco-9n6w8                          100m (0%)     1 (6%)      512Mi (1%)       1Gi (2%)       69s
  falco                       node-problem-detector-falco-xdxt5    20m (0%)      100m (0%)   64Mi (0%)        128Mi (0%)     9m43s
  kube-system                 kindnet-cj2cm                        100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      12m
  kube-system                 kube-proxy-nj887                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12m
Allocated resources:
 ...
Events:
  Type     Reason            Age                  From                       Message
  ----     ------            ----                 ----                       -------
  Normal   RegisteredNode    12m                  node-controller            Node security-agent-demo-worker event: Registered Node security-agent-demo-worker in Controller
  Normal   TaintAdopted      8m31s                node-readiness-controller  Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' is now managed by rule 'security-agent-readiness-rule-npd'
  Warning  FalcoNotDeployed  104s (x2 over 9m4s)  falco-monitor              Node condition falco.org/FalcoNotReady is now: True, reason: FalcoNotDeployed, message: "Falco is not deployed or not responding on port 8765"
  Normal   TaintAdded        103s                 node-readiness-controller  Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' added by rule 'security-agent-readiness-rule-npd'
  Normal   FalcoHealthy      14s (x2 over 5m4s)   falco-monitor              Node condition falco.org/FalcoNotReady is now: False, reason: FalcoHealthy, message: "Falco security monitoring is functional"
  Normal   TaintRemoved      13s (x2 over 5m3s)   node-readiness-controller  Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' removed by rule 'security-agent-readiness-rule-npd'

Checklist

  • make test passes
  • make test-e2e passes
  • make lint passes
  • make verify passes
NONE

@k8s-ci-robot k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 7, 2026
@netlify
Copy link

netlify bot commented Mar 7, 2026

Deploy Preview for node-readiness-controller ready!

Name Link
🔨 Latest commit 29669b6
🔍 Latest deploy log https://app.netlify.com/projects/node-readiness-controller/deploys/69afe5e90e6cb80008427189
😎 Deploy Preview https://deploy-preview-154--node-readiness-controller.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 7, 2026
@Priyankasaggu11929 Priyankasaggu11929 force-pushed the npd-security-agent-example branch from f1dd942 to 2f1c2d1 Compare March 7, 2026 22:02

We can use the Node Readiness Controller to enforce a security readiness guardrail:
1. **Taint** the node with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule` as soon as it joins the cluster.
1. **Taint** the node with a [startup taint](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/) `readiness.k8s.io/security-agent-ready=pending:NoSchedule` as soon as it joins the cluster.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule is an invalid taint format

❯ kubectl taint nodes security-agent-demo-worker readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule
error: invalid taint spec: readiness.k8s.io/falco.org/security-agent-ready=pending:NoSchedule, a qualified name must consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyName',  or 'my.name',  or '123-abc', regex used for validation is '([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]') with an optional DNS subdomain prefix and '/' (e.g. 'example.com/MyName')
See 'kubectl taint -h' for help and examples

Proposed a fix: #155

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I didn't know there's a limitation to have only one domain. Thanks for flagging this!

It maybe beneficial for our usecases to have separate "subdomains" support tho. :(

We could followup further on this requirement later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It maybe beneficial for our usecases to have separate "subdomains" support tho. :(

so, the problem is the 2 slashes (/).

one of the ways for subdomain purposes, could be <component>.readiness.k8s.io/security-agent-ready (with CEL support, it might work)

but yes, will discuss it in a followup

@Priyankasaggu11929 Priyankasaggu11929 force-pushed the npd-security-agent-example branch from 2f1c2d1 to 3246330 Compare March 9, 2026 11:57
Copy link
Contributor

@ajaysundark ajaysundark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Nice improvements. Thanks for taking a deeper look into this. Consider some suggestions on prefixed conditions, otherwise good to merge.

"source": "falco-monitor",
"conditions": [
{
"type": "FalcoProblem",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your documentation uses different condition ('falco.org/FalcoReady') than your usage here.

I prefer the earlier as including the domain name in the node condition also clearly give the ownership.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use the type falco.org/FalcoNotReady (with the domain part) in the latest commit refresh.

can't use FalcoReady because NPD treats all conditions as problem-oriented (which means exit 0 -> condition=False), so using FalcoReady would result in backwards events (FalcoReady=True when Falco is not up).

The NRR (reporter sidecar) variant still uses falco.org/FalcoReady.

Example output with updated NPD condition

❯ kubectl describe node security-agent-demo-worker 
Name:               security-agent-demo-worker
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=security-agent-demo-worker
                    kubernetes.io/os=linux
Annotations:        node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 10 Mar 2026 14:47:03 +0530
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  security-agent-demo-worker
  AcquireTime:     <unset>
  RenewTime:       Tue, 10 Mar 2026 14:59:48 +0530
Conditions:
  Type                      Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                      ------  -----------------                 ------------------                ------                       -------
  falco.org/FalcoNotReady   False   Tue, 10 Mar 2026 14:59:40 +0530   Tue, 10 Mar 2026 14:59:39 +0530   FalcoHealthy                 Falco security monitoring is functional
  MemoryPressure            False   Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:03 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure              False   Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:03 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure               False   Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:03 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                     True    Tue, 10 Mar 2026 14:58:06 +0530   Tue, 10 Mar 2026 14:47:17 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:  172.20.0.3
  Hostname:    security-agent-demo-worker
Capacity:
  cpu:                16
  ephemeral-storage:  974453Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             49039448Ki
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  974453Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             49039448Ki
  pods:               110
System Info:
  Machine ID:                 5831b339edfc434791f95c24f8ce8daf
  System UUID:                c77ada22-4cad-4515-8e9a-2a4204e7af79
  Boot ID:                    88ce2b03-78b8-4c5e-aef2-f1e6c58edcb9
  Kernel Version:             6.18.8-1-default
  OS Image:                   Debian GNU/Linux 12 (bookworm)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://2.2.0
  Kubelet Version:            v1.35.0
  Kube-Proxy Version:         
PodCIDR:                      10.244.1.0/24
PodCIDRs:                     10.244.1.0/24
ProviderID:                   kind://docker/security-agent-demo/security-agent-demo-worker
Non-terminated Pods:          (4 in total)
  Namespace                   Name                                 CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                 ------------  ----------  ---------------  -------------  ---
  falco                       falco-9n6w8                          100m (0%)     1 (6%)      512Mi (1%)       1Gi (2%)       69s
  falco                       node-problem-detector-falco-xdxt5    20m (0%)      100m (0%)   64Mi (0%)        128Mi (0%)     9m43s
  kube-system                 kindnet-cj2cm                        100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      12m
  kube-system                 kube-proxy-nj887                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         12m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                220m (1%)   1200m (7%)
  memory             626Mi (1%)  1202Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:
  Type     Reason            Age                  From                       Message
  ----     ------            ----                 ----                       -------
  Normal   RegisteredNode    12m                  node-controller            Node security-agent-demo-worker event: Registered Node security-agent-demo-worker in Controller
  Normal   TaintAdopted      8m31s                node-readiness-controller  Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' is now managed by rule 'security-agent-readiness-rule-npd'
  Warning  FalcoNotDeployed  104s (x2 over 9m4s)  falco-monitor              Node condition falco.org/FalcoNotReady is now: True, reason: FalcoNotDeployed, message: "Falco is not deployed or not responding on port 8765"
  Normal   TaintAdded        103s                 node-readiness-controller  Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' added by rule 'security-agent-readiness-rule-npd'
  Normal   FalcoHealthy      14s (x2 over 5m4s)   falco-monitor              Node condition falco.org/FalcoNotReady is now: False, reason: FalcoHealthy, message: "Falco security monitoring is functional"
  Normal   TaintRemoved      13s (x2 over 5m3s)   node-readiness-controller  Taint 'readiness.k8s.io/security-agent-ready:NoSchedule' removed by rule 'security-agent-readiness-rule-npd'

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 9, 2026
@Priyankasaggu11929 Priyankasaggu11929 force-pushed the npd-security-agent-example branch from 3246330 to 04a2c07 Compare March 10, 2026 09:30
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 10, 2026
@Priyankasaggu11929 Priyankasaggu11929 force-pushed the npd-security-agent-example branch from 04a2c07 to 29669b6 Compare March 10, 2026 09:35
This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`.
#### Option A: Using Node Readiness Reporter Sidecar

The reporter is deployed as a sidecar container in the Falco DaemonSet. This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The reporter is deployed as a sidecar container in the Falco DaemonSet. This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoReady`.
The reporter is deployed as a sidecar container in the Falco DaemonSet. This sidecar periodically checks Falco's local health endpoint (`http://localhost:8765/healthz`) and updates a Node Condition `falco.org/FalcoNotReady`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, I think I misunderstood, you have used both positive and negative cases for two variants.
It maybe easy for the reader to just pick one to avoid confusion.

Copy link
Contributor

@ajaysundark ajaysundark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

mostly looks good to me. some minor comments but not very opinionated.
/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 12, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ajaysundark, Priyankasaggu11929

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants