feat(metrics): authenticate /metrics endpoint (bearer token + optional mTLS)#77
feat(metrics): authenticate /metrics endpoint (bearer token + optional mTLS)#77abhinavuser wants to merge 8 commits into
Conversation
implements production-grade auth modes for the /metrics endpoint as requested in optiqor#41. defaults to 'none' to preserve backwards compatibility. features: - bearer token mode: constant-time comparison to prevent timing side channels. - sighup token rotation: bearer token file is watched and reloaded on sighup without dropping inflight requests. - mtls mode: config builder for mutual tls with tls 1.3 minimum enforcement. - prometheus auth config: wired seamlessly into config.go and start.go - /healthz and /readyz remain fully unauthenticated. added test coverage for bearer token modes, sighup hot-reload, and pass-through. closes optiqor#41 Signed-off-by: abhinavuser <abhinavuser@users.noreply.github.com>
Signed-off-by: abhinavuser <abhinavuser@users.noreply.github.com>
|
🚀 First PR — welcome aboard! A few things to expect:
If you get stuck, reply here or jump to Discussions. We want this PR to land. |
btwshivam
left a comment
There was a problem hiding this comment.
mtls mode breaks kubelet probes. /healthz and /readyz share a mux with /metrics, so ListenAndServeTLS puts the health endpoints behind RequireAndVerifyClientCert and the chart's httpGet probes will fail every interval.
| var err error | ||
| if tlsCfg != nil { | ||
| // Certs are already loaded into TLSConfig | ||
| err = httpServer.ListenAndServeTLS("", "") |
There was a problem hiding this comment.
mtls puts the entire listener behind TLS plus client-cert-required, but /healthz and /readyz at lines 157-158 are on the same mux. kubelet probes hit /healthz over plain HTTP without a client cert (the chart's livenessProbe at deploy/helm/kerno/templates/daemonset.yaml uses httpGet:), so once mtls is on every probe fails and the pod restarts every 30s. either split health onto a second plain-HTTP listener, or scope tls to just the /metrics mux. the bearer path got this right (only wraps /metrics), mtls broke the pattern.
| @@ -0,0 +1,65 @@ | |||
| // Copyright 2026 Optiqor contributors | |||
There was a problem hiding this comment.
bearer.go has six unit tests, mtls.go has zero. at minimum cover the four branches: empty certFile returns nil, certFile without keyFile errors, cert+key without CA produces one-way TLS, cert+key+CA sets RequireAndVerifyClientCert. these are pure-function checks, no real cert generation needed beyond httptest-style helpers.
btwshivam
left a comment
There was a problem hiding this comment.
ci's red because the reload test rewrites a 0o400 file (permission-denied) and two lint issues remain. the mtls path also has no tests.
| } | ||
|
|
||
| // Rotate the token on disk and reload. | ||
| if err := os.WriteFile(path, []byte("new-token\n"), 0o400); err != nil { |
There was a problem hiding this comment.
this rewrite fails with permission-denied because writeTokenFile at line 25 creates the file 0o400 (read-only) and you're writing back to the same path. switch the initial mode to 0o600, or os.Chmod(path, 0o600) before this line. that's the only test failure in the ci run.
| mux.HandleFunc("/readyz", healthzHandler(loadedCount, len(loaders))) | ||
| mux.Handle("/metrics", promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{})) | ||
|
|
||
| var metricsHandler http.Handler = promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{}) |
There was a problem hiding this comment.
staticcheck QF1011: the explicit http.Handler type is redundant since promhttp.HandlerFor already returns http.Handler and reassigning to guard.Wrap(metricsHandler) keeps that type. drop the type annotation:
metricsHandler := promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{})| // loadCertPool reads a PEM-encoded CA certificate bundle and returns a cert | ||
| // pool suitable for use as tls.Config.ClientCAs. | ||
| func loadCertPool(path string) (*x509.CertPool, error) { | ||
| pem, err := os.ReadFile(path) |
There was a problem hiding this comment.
gosec G304: the path comes from operator config not user input, so it's a false positive here. same flag fires on bearer.go load() for the same reason. add // #nosec G304 -- path comes from operator config on both call sites, or wrap reads in a helper that asserts the path lives under a known config dir. either's fine, but lint needs to be green.
| // | ||
| // Returns nil when certFile is empty, indicating that plain HTTP should be | ||
| // used (i.e. auth.mode is "none" or "bearer" without TLS). | ||
| func TLSConfig(certFile, keyFile, caFile string) (*tls.Config, error) { |
There was a problem hiding this comment.
no test coverage for the mtls path. bearer is well-tested in bearer_test.go, but TLSConfig has none. at minimum: a test that verifies RequireAndVerifyClientCert is set when caFile is non-empty, and a negative test that loadCertPool fails on a non-PEM file. with how security-sensitive this is, an end-to-end test using httptest.NewTLSServer plus a client cert that validates against the configured CA would be ideal.
This addresses the review by splitting the health probe listener onto port 9092, adding mTLS unit and end-to-end tests, fixing bearer tests permission issues, and removing gosec/staticcheck lint warnings. Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
This fixes bodyclose, noctx, gofmt, and IP SAN missing from the test certificate generated in mtls_test.go. Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
|
hey @btwshivam , sorry for the noise with the multiple PRs, i've closed out the logging one and i'm pausing work on the sinks PR until this one is clean and merged i've pushed a new commit that addresses all your feedback:
let me know if you need anythin or if there s any issues |
|
ill review them shortly |
btwshivam
left a comment
There was a problem hiding this comment.
two blockers: mode mtls without ca_cert_file silently drops client verification, and the helm auth config only wires the prometheus side so the kerno container never enforces it. details inline.
| if c.Prometheus.Auth.Mode == "bearer" && c.Prometheus.Auth.TokenFile == "" { | ||
| return fmt.Errorf("prometheus.auth.token_file is required when auth mode is bearer") | ||
| } | ||
| if c.Prometheus.Auth.Mode == "mtls" && (c.Prometheus.Auth.CertFile == "" || c.Prometheus.Auth.KeyFile == "") { |
There was a problem hiding this comment.
mode mtls requires cert_file and key_file here but not ca_cert_file. TLSConfig only sets RequireAndVerifyClientCert when caFile is non-empty (mtls.go), so mode: mtls with no ca_cert_file comes up as one-way TLS and clients are never asked for a cert. that's a server advertising mtls while not enforcing it, which is worse than failing closed. require ca_cert_file here when mode is mtls.
| interval: {{ .Values.serviceMonitor.interval }} | ||
| path: /metrics | ||
| {{- if eq .Values.prometheus.auth.mode "bearer" }} | ||
| authorization: |
There was a problem hiding this comment.
this configures Prometheus to send a bearer token, but nothing configures kerno to require one. the daemonset still runs start --prometheus-addr=... with no auth.mode, no token_file, and no secret mounted, so the container stays on the default mode: none.
net effect: auth.mode: bearer leaves /metrics unauthenticated while Prometheus sends a token nobody checks, and auth.mode: mtls breaks scraping outright since Prometheus does TLS but kerno serves plain http. the kerno side needs the same values: pass auth.mode plus the token/cert paths to the container, and mount the token/tls secret as a volume.
| healthAddr = ":9092" | ||
| } | ||
|
|
||
| healthServer = &http.Server{ |
There was a problem hiding this comment.
the health server moves to :9092 unconditionally, including mode: none. anything hitting :9090/healthz today (systemd unit, external LB check, the old default) loses it after upgrade, even though the PR says none preserves backward compatibility. the split only matters once metrics gets tls. for none and bearer, keep /healthz and /readyz on the metrics addr, or gate the second listener on mtls.
| metricsMux := http.NewServeMux() | ||
| metricsHandler := promhttp.HandlerFor(metrics.Registry, promhttp.HandlerOpts{}) | ||
|
|
||
| if cfg.Prometheus.Auth.Mode == "bearer" { |
There was a problem hiding this comment.
bearer mode never sets up tls, tlsCfg stays nil unless mode is mtls, so the token travels in cleartext over plain http. on a hostNetwork daemonset that's the node wire. TLSConfig already supports cert+key without a CA (one-way tls), but there's no way to reach it in bearer mode. either let bearer run over tls too, or document the cleartext exposure.
|
good work you are going in ryt direction.. just some blockers.. fix ill re-review |
- config: validate ca_cert_file for mtls mode to prevent silent fallback to one-way TLS - helm: wire prometheus auth mode and secret mounts to kerno daemonset so the container actually enforces auth - cli/start: only split health endpoints to a separate listener in mtls mode (preserves backward compat for none/bearer) - cli/start: allow bearer mode to optionally pick up TLS if cert_file and key_file are provided, preventing cleartext token transmission Signed-off-by: abhinavuser <abhinavkumarv2101@gmail.com>
|
hey @btwshivam i ve pushed a new commit that addresses all four blockers
|
btwshivam
left a comment
There was a problem hiding this comment.
good progress, both code blockers are fixed: mtls now requires ca_cert_file, and the health server only splits off in mtls mode. but the helm auth wireup routes config through KERNO_* env vars, which viper doesn't actually read (#112), so the container still comes up mode=none and /metrics stays unauthenticated. and a stray pr_body.md from a different PR slipped in.
| - name: KERNO_LOG_FORMAT | ||
| value: {{ .Values.log.format }} | ||
| - name: KERNO_PROMETHEUS_AUTH_MODE | ||
| value: {{ .Values.prometheus.auth.mode | quote }} |
There was a problem hiding this comment.
this is the right shape, but it won't take effect. KERNO_PROMETHEUS_AUTH_MODE (and the token/cert file env vars below) go through viper's AutomaticEnv, and per #112 viper never reads those because initConfig registers no defaults or BindEnv keys, so Unmarshal has nothing to populate from env. net result is the container still runs auth.mode=none: bearer leaves /metrics unauthenticated while prometheus sends a token nobody checks, and mtls breaks scraping. the secrets are mounted as files (good), so the robust fix is to render the auth settings into a ConfigMap mounted at /etc/kerno/config.yaml, which viper does read via ReadInConfig, rather than KERNO_* env. otherwise this depends on #112 landing first, please verify end to end before merge since a silently-unenforced auth is worse than no feature.
| @@ -0,0 +1,16 @@ | |||
| fixes #29 | |||
There was a problem hiding this comment.
this file doesn't belong in the PR, and it's not even this PR, the content is for #29 (webhook sinks, --sink flags). looks like it got committed from another branch. git rm pr_body.md and drop it.
closes #41
implements production-grade auth modes for the
/metricsendpoint. defaults tononeto preserve backward compatibility.features:
config.goandstart.go./healthzand/readyzremain fully unauthenticated.added test coverage for bearer token modes, sighup hot-reload, empty file handling, and pass-through.
Signed-off-by: abhinavuser abhinavkumarv2101@gmail.com