Skip to content

feat(metrics): authenticate /metrics endpoint (bearer token + optional mTLS)#77

Open
abhinavuser wants to merge 8 commits into
optiqor:mainfrom
abhinavuser:feat/metrics-bearer-auth
Open

feat(metrics): authenticate /metrics endpoint (bearer token + optional mTLS)#77
abhinavuser wants to merge 8 commits into
optiqor:mainfrom
abhinavuser:feat/metrics-bearer-auth

Conversation

@abhinavuser
Copy link
Copy Markdown

@abhinavuser abhinavuser commented May 15, 2026

closes #41

implements production-grade auth modes for the /metrics endpoint. defaults to none to preserve backward compatibility.

features:

  • bearer token mode: constant-time comparison to prevent timing side channels.
    • sighup token rotation: bearer token file is watched and reloaded on sighup without dropping inflight requests.
    • mtls mode: config builder for mutual tls with tls 1.3 minimum enforcement.
    • integration: wired seamlessly into config.go and start.go.
    • helm chart: added prometheus auth configuration for token secrets, tls secrets, and ServiceMonitor examples.
    • /healthz and /readyz remain fully unauthenticated.
      added test coverage for bearer token modes, sighup hot-reload, empty file handling, and pass-through.

Signed-off-by: abhinavuser abhinavkumarv2101@gmail.com

implements production-grade auth modes for the /metrics endpoint as requested in optiqor#41.
defaults to 'none' to preserve backwards compatibility.

features:
- bearer token mode: constant-time comparison to prevent timing side channels.
- sighup token rotation: bearer token file is watched and reloaded on sighup without dropping inflight requests.
- mtls mode: config builder for mutual tls with tls 1.3 minimum enforcement.
- prometheus auth config: wired seamlessly into config.go and start.go
- /healthz and /readyz remain fully unauthenticated.

added test coverage for bearer token modes, sighup hot-reload, and pass-through.

closes optiqor#41

Signed-off-by: abhinavuser <abhinavuser@users.noreply.github.com>
Signed-off-by: abhinavuser <abhinavuser@users.noreply.github.com>
@abhinavuser abhinavuser requested a review from btwshivam as a code owner May 15, 2026 16:00
@github-actions github-actions Bot added testing Tests and test coverage area/k8s Kubernetes integration area/security Security and supply chain area/ops Operations, deployment, runtime ergonomics labels May 15, 2026
@github-actions
Copy link
Copy Markdown

🚀 First PR — welcome aboard!

A few things to expect:

  1. CI: every PR runs build + race tests + lint + (eventually) the kernel matrix. If something fails, the log will tell you exactly which gate.
  2. DCO: every commit needs Signed-off-by:git commit -s adds it automatically.
  3. Conventional Commits: PR titles like feat(doctor): add new rule or fix(bpf): handle X. We squash-merge by default.
  4. Review: a maintainer will review within 72 hours. Suggestions are conversations, not orders — push back if something doesn't fit your context.

If you get stuck, reply here or jump to Discussions. We want this PR to land.

@abhinavuser abhinavuser changed the title feat(security): authenticate /metrics endpoint (bearer token + optional mTLS) feat(metrics): authenticate /metrics endpoint (bearer token + optional mTLS) May 15, 2026
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtls mode breaks kubelet probes. /healthz and /readyz share a mux with /metrics, so ListenAndServeTLS puts the health endpoints behind RequireAndVerifyClientCert and the chart's httpGet probes will fail every interval.

Comment thread internal/cli/start.go
var err error
if tlsCfg != nil {
// Certs are already loaded into TLSConfig
err = httpServer.ListenAndServeTLS("", "")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mtls puts the entire listener behind TLS plus client-cert-required, but /healthz and /readyz at lines 157-158 are on the same mux. kubelet probes hit /healthz over plain HTTP without a client cert (the chart's livenessProbe at deploy/helm/kerno/templates/daemonset.yaml uses httpGet:), so once mtls is on every probe fails and the pod restarts every 30s. either split health onto a second plain-HTTP listener, or scope tls to just the /metrics mux. the bearer path got this right (only wraps /metrics), mtls broke the pattern.

Comment thread internal/auth/mtls.go
@@ -0,0 +1,65 @@
// Copyright 2026 Optiqor contributors
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bearer.go has six unit tests, mtls.go has zero. at minimum cover the four branches: empty certFile returns nil, certFile without keyFile errors, cert+key without CA produces one-way TLS, cert+key+CA sets RequireAndVerifyClientCert. these are pure-function checks, no real cert generation needed beyond httptest-style helpers.

Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ci's red because the reload test rewrites a 0o400 file (permission-denied) and two lint issues remain. the mtls path also has no tests.

}

// Rotate the token on disk and reload.
if err := os.WriteFile(path, []byte("new-token\n"), 0o400); err != nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this rewrite fails with permission-denied because writeTokenFile at line 25 creates the file 0o400 (read-only) and you're writing back to the same path. switch the initial mode to 0o600, or os.Chmod(path, 0o600) before this line. that's the only test failure in the ci run.

Comment thread internal/cli/start.go Outdated
mux.HandleFunc("/readyz", healthzHandler(loadedCount, len(loaders)))
mux.Handle("/metrics", promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{}))

var metricsHandler http.Handler = promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{})
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

staticcheck QF1011: the explicit http.Handler type is redundant since promhttp.HandlerFor already returns http.Handler and reassigning to guard.Wrap(metricsHandler) keeps that type. drop the type annotation:

metricsHandler := promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{})

Comment thread internal/auth/mtls.go
// loadCertPool reads a PEM-encoded CA certificate bundle and returns a cert
// pool suitable for use as tls.Config.ClientCAs.
func loadCertPool(path string) (*x509.CertPool, error) {
pem, err := os.ReadFile(path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gosec G304: the path comes from operator config not user input, so it's a false positive here. same flag fires on bearer.go load() for the same reason. add // #nosec G304 -- path comes from operator config on both call sites, or wrap reads in a helper that asserts the path lives under a known config dir. either's fine, but lint needs to be green.

Comment thread internal/auth/mtls.go
//
// Returns nil when certFile is empty, indicating that plain HTTP should be
// used (i.e. auth.mode is "none" or "bearer" without TLS).
func TLSConfig(certFile, keyFile, caFile string) (*tls.Config, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no test coverage for the mtls path. bearer is well-tested in bearer_test.go, but TLSConfig has none. at minimum: a test that verifies RequireAndVerifyClientCert is set when caFile is non-empty, and a negative test that loadCertPool fails on a non-PEM file. with how security-sensitive this is, an end-to-end test using httptest.NewTLSServer plus a client cert that validates against the configured CA would be ideal.

This addresses the review by splitting the health probe listener onto port 9092, adding mTLS unit and end-to-end tests, fixing bearer tests permission issues, and removing gosec/staticcheck lint warnings.

Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
@github-actions github-actions Bot added the level:critical Touches BPF, security, or release surfaces (auto-applied) label May 17, 2026
This fixes bodyclose, noctx, gofmt, and IP SAN missing from the test certificate generated in mtls_test.go.

Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
@abhinavuser
Copy link
Copy Markdown
Author

hey @btwshivam , sorry for the noise with the multiple PRs, i've closed out the logging one and i'm pausing work on the sinks PR until this one is clean and merged

i've pushed a new commit that addresses all your feedback:

  1. health probes: split the HTTP servers. /healthz and /readyz now live on a dedicated plain-HTTP listener (default :9092), and the Helm chart livenessProbe/readinessProbe have been updated to point to this new health port. mTLS now exclusively wraps the /metrics listener
    2.mtls tests: added mtls_test.go covering all 4 pure-function branches, plus an end-to-end httptest.NewUnstartedServer test that validates client cert enforcement
  2. ci test fix: updated writeTokenFile in bearer_test.go to use 0o600 so the reload test can overwrite it without permission errors
  3. linters: dropped the redundant http.Handler type to fix QF1011, and added the #nosec G304 directives to fix the gosec false positives. (Also fixed a few minor test linting flakes)

let me know if you need anythin or if there s any issues

@abhinavuser abhinavuser requested a review from btwshivam May 17, 2026 04:11
@btwshivam
Copy link
Copy Markdown
Member

ill review them shortly

Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two blockers: mode mtls without ca_cert_file silently drops client verification, and the helm auth config only wires the prometheus side so the kerno container never enforces it. details inline.

Comment thread internal/config/config.go
if c.Prometheus.Auth.Mode == "bearer" && c.Prometheus.Auth.TokenFile == "" {
return fmt.Errorf("prometheus.auth.token_file is required when auth mode is bearer")
}
if c.Prometheus.Auth.Mode == "mtls" && (c.Prometheus.Auth.CertFile == "" || c.Prometheus.Auth.KeyFile == "") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode mtls requires cert_file and key_file here but not ca_cert_file. TLSConfig only sets RequireAndVerifyClientCert when caFile is non-empty (mtls.go), so mode: mtls with no ca_cert_file comes up as one-way TLS and clients are never asked for a cert. that's a server advertising mtls while not enforcing it, which is worse than failing closed. require ca_cert_file here when mode is mtls.

interval: {{ .Values.serviceMonitor.interval }}
path: /metrics
{{- if eq .Values.prometheus.auth.mode "bearer" }}
authorization:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this configures Prometheus to send a bearer token, but nothing configures kerno to require one. the daemonset still runs start --prometheus-addr=... with no auth.mode, no token_file, and no secret mounted, so the container stays on the default mode: none.

net effect: auth.mode: bearer leaves /metrics unauthenticated while Prometheus sends a token nobody checks, and auth.mode: mtls breaks scraping outright since Prometheus does TLS but kerno serves plain http. the kerno side needs the same values: pass auth.mode plus the token/cert paths to the container, and mount the token/tls secret as a volume.

Comment thread internal/cli/start.go Outdated
healthAddr = ":9092"
}

healthServer = &http.Server{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the health server moves to :9092 unconditionally, including mode: none. anything hitting :9090/healthz today (systemd unit, external LB check, the old default) loses it after upgrade, even though the PR says none preserves backward compatibility. the split only matters once metrics gets tls. for none and bearer, keep /healthz and /readyz on the metrics addr, or gate the second listener on mtls.

Comment thread internal/cli/start.go
metricsMux := http.NewServeMux()
metricsHandler := promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{})

if cfg.Prometheus.Auth.Mode == "bearer" {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bearer mode never sets up tls, tlsCfg stays nil unless mode is mtls, so the token travels in cleartext over plain http. on a hostNetwork daemonset that's the node wire. TLSConfig already supports cert+key without a CA (one-way tls), but there's no way to reach it in bearer mode. either let bearer run over tls too, or document the cleartext exposure.

@btwshivam
Copy link
Copy Markdown
Member

good work you are going in ryt direction.. just some blockers.. fix ill re-review

- config: validate ca_cert_file for mtls mode to prevent silent fallback to one-way TLS
- helm: wire prometheus auth mode and secret mounts to kerno daemonset so the container actually enforces auth
- cli/start: only split health endpoints to a separate listener in mtls mode (preserves backward compat for none/bearer)
- cli/start: allow bearer mode to optionally pick up TLS if cert_file and key_file are provided, preventing cleartext token transmission

Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 31, 2026
@abhinavuser
Copy link
Copy Markdown
Author

hey @btwshivam i ve pushed a new commit that addresses all four blockers

  1. Config Validation: config.Validate() now strictly requires ca_cert_file when mode is mtls to prevent the silent one way TLS fallback
  2. Helm Daemonset: wired up the auth environment variables (KERNO_PROMETHEUS_AUTH_MODE, etc.) and the secret volumeMounts in daemonset.yaml so the container actually enforces the auth that ServiceMonitor configures
  3. Health Probes: only split /healthz and /readyz onto the :9092 listener when mode is mtls. For none and bearer, they remain on the metrics mux to preserve backward compatibility
  4. Bearer TLS: bearer mode now builds a tls.Config if cert_file and key_file are provided, preventin the token from travellin in cleartext
    lmk if this looks good to merge

@abhinavuser abhinavuser requested a review from btwshivam May 31, 2026 12:13
Copy link
Copy Markdown
Member

@btwshivam btwshivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good progress, both code blockers are fixed: mtls now requires ca_cert_file, and the health server only splits off in mtls mode. but the helm auth wireup routes config through KERNO_* env vars, which viper doesn't actually read (#112), so the container still comes up mode=none and /metrics stays unauthenticated. and a stray pr_body.md from a different PR slipped in.

- name: KERNO_LOG_FORMAT
value: {{ .Values.log.format }}
- name: KERNO_PROMETHEUS_AUTH_MODE
value: {{ .Values.prometheus.auth.mode | quote }}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the right shape, but it won't take effect. KERNO_PROMETHEUS_AUTH_MODE (and the token/cert file env vars below) go through viper's AutomaticEnv, and per #112 viper never reads those because initConfig registers no defaults or BindEnv keys, so Unmarshal has nothing to populate from env. net result is the container still runs auth.mode=none: bearer leaves /metrics unauthenticated while prometheus sends a token nobody checks, and mtls breaks scraping. the secrets are mounted as files (good), so the robust fix is to render the auth settings into a ConfigMap mounted at /etc/kerno/config.yaml, which viper does read via ReadInConfig, rather than KERNO_* env. otherwise this depends on #112 landing first, please verify end to end before merge since a silently-unenforced auth is worse than no feature.

Comment thread pr_body.md
@@ -0,0 +1,16 @@
fixes #29
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file doesn't belong in the PR, and it's not even this PR, the content is for #29 (webhook sinks, --sink flags). looks like it got committed from another branch. git rm pr_body.md and drop it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/k8s Kubernetes integration area/ops Operations, deployment, runtime ergonomics area/security Security and supply chain documentation Improvements or additions to documentation level:critical Touches BPF, security, or release surfaces (auto-applied) testing Tests and test coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Security: authenticate /metrics endpoint (bearer token + optional mTLS)

2 participants