Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 30 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ A Kubernetes controller that automatically creates and manages kuberik HealthChe

## Overview

This controller watches for DatadogMonitor resources and automatically creates corresponding kuberik HealthCheck resources when a specific annotation is present. The HealthCheck status is then continuously updated based on the DatadogMonitor's health state.
This controller watches for DatadogMonitor resources and automatically creates corresponding kuberik HealthCheck resources when a specific annotation is present. When a monitor provides Datadog deployment-gate inputs, the HealthCheck status is driven by the Datadog Deployment Gates API; otherwise it falls back to the DatadogMonitor status.

## How It Works

Expand All @@ -28,11 +28,17 @@ When the annotation is present:

### 3. Status Synchronization

The controller continuously monitors the DatadogMonitor status and updates the HealthCheck accordingly:
When the monitor has both `service:` and `env:` tags (or equivalent override annotations), the controller evaluates a Datadog deployment gate and maps the gate result to the HealthCheck:

- **OK** → `Healthy`
- **Alert** → `Unhealthy` (with error timestamp)
- **Warn/NoData/Skipped/Ignored** → `Pending`
- **pass** -> `Healthy`
- **fail** -> `Unhealthy` (with error timestamp)
- **in_progress** -> `Pending`

If deployment-gate inputs are not configured, the controller falls back to the DatadogMonitor status:

- **OK** -> `Healthy`
- **Alert** -> `Unhealthy` (with error timestamp)
- **Warn/NoData/Skipped/Ignored** -> `Pending`

### 4. Cleanup

Expand All @@ -50,11 +56,18 @@ metadata:
namespace: production
annotations:
kuberik.com/health-check: "true"
# Optional deployment-gate overrides:
# kuberik.com/datadog-gate-identifier: "canary"
# kuberik.com/datadog-gate-version: "1.2.3"
# kuberik.com/datadog-gate-apm-primary-tag: "team:platform"
spec:
name: "High CPU Alert"
type: "metric alert"
query: "avg(last_5m):avg:system.cpu.user{*} > 80"
message: "CPU usage is high"
tags:
- "env:production"
- "service:payments"
```

This will automatically create a HealthCheck named `datadog-check-my-monitor` in the `production` namespace.
Expand Down Expand Up @@ -83,6 +96,18 @@ The system consists of two main controllers:
The controller uses the following annotation key:
- `kuberik.com/health-check`: Set to `"true"` to enable HealthCheck creation

Optional deployment-gate annotations on the DatadogMonitor:
- `kuberik.com/datadog-gate-service`: Override the Datadog service instead of reading the `service:` tag
- `kuberik.com/datadog-gate-env`: Override the Datadog environment instead of reading the `env:` tag
- `kuberik.com/datadog-gate-identifier`: Override the gate identifier (defaults to `default`)
- `kuberik.com/datadog-gate-version`: Version forwarded to the Datadog gate evaluation request
- `kuberik.com/datadog-gate-apm-primary-tag`: APM primary tag forwarded to the Datadog gate evaluation request

To enable deployment-gate evaluations, set Datadog credentials in the controller environment:
- `DATADOG_API_KEY` or `DD_API_KEY`
- `DATADOG_APP_KEY` or `DD_APP_KEY`
- `DATADOG_SITE` or `DD_SITE` (optional, defaults to `datadoghq.com`)

## Monitoring

The controller logs all operations including:
Expand Down
3 changes: 3 additions & 0 deletions config/samples/datadogmonitor-sample.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ metadata:
namespace: default
annotations:
kuberik.com/health-check: "true"
# kuberik.com/datadog-gate-identifier: "canary"
# kuberik.com/datadog-gate-version: "1.2.3"
# kuberik.com/datadog-gate-apm-primary-tag: "team:platform"
spec:
name: "Sample Monitor"
type: "metric alert"
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ module github.com/kuberik/datadog-controller
go 1.24.2

require (
github.com/DataDog/datadog-api-client-go/v2 v2.56.0
github.com/DataDog/datadog-operator v1.8.0
github.com/kuberik/rollout-controller v0.2.1
github.com/onsi/ginkgo/v2 v2.22.0
Expand All @@ -15,7 +16,6 @@ require (

require (
cel.dev/expr v0.19.1 // indirect
github.com/DataDog/datadog-api-client-go/v2 v2.19.0 // indirect
github.com/DataDog/extendeddaemonset v0.10.0-rc.4 // indirect
github.com/DataDog/zstd v1.5.2 // indirect
github.com/Masterminds/semver/v3 v3.1.1 // indirect
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
cel.dev/expr v0.19.1 h1:NciYrtDRIR0lNCnH1LFJegdjspNx9fI59O7TWcua/W4=
cel.dev/expr v0.19.1/go.mod h1:MrpN08Q+lEBs+bGYdLxxHkZoUSsCp0nSKTs0nTymJgw=
github.com/DataDog/datadog-api-client-go/v2 v2.19.0 h1:Wvz/63/q39EpVwSH1T8jVyRvPcMfEABenU7sD3dO2Lc=
github.com/DataDog/datadog-api-client-go/v2 v2.19.0/go.mod h1:oD5Lx8Li3oPRa/BSBenkn4i48z+91gwYORF/+6ph71g=
github.com/DataDog/datadog-api-client-go/v2 v2.56.0 h1:HKcfvAODmJCUw7nfbDKKqkEUgcu7CfxUPA9EFRJrHEI=
github.com/DataDog/datadog-api-client-go/v2 v2.56.0/go.mod h1:d3tOEgUd2kfsr9uuHQdY+nXrWp4uikgTgVCPdKNK30U=
github.com/DataDog/datadog-operator v1.8.0 h1:5gzza6p+kwxkO0kYfKpN6c8l96xlP6bk7K5pxDEPRZg=
github.com/DataDog/datadog-operator v1.8.0/go.mod h1:IaTKfjDrsmc7pcBCaKnlhS/I68GuiFIpAoM0+fh3QfQ=
github.com/DataDog/extendeddaemonset v0.10.0-rc.4 h1:m88E+emuRHIqKgi7kHMd9N0S/NtruCCOISp3cjB7DNs=
Expand Down
263 changes: 263 additions & 0 deletions internal/controller/datadog_gates_client.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
package controller

import (
"context"
"errors"
"fmt"
"net/http"
"net/url"
"os"
"strings"
"time"

"github.com/DataDog/datadog-api-client-go/v2/api/datadog"
)

const (
datadogAPIKeyEnvVar = "DATADOG_API_KEY"
datadogAltAPIKeyEnvVar = "DD_API_KEY"
datadogAppKeyEnvVar = "DATADOG_APP_KEY"
datadogAltAppKeyEnvVar = "DD_APP_KEY"
datadogSiteEnvVar = "DATADOG_SITE"
datadogAltSiteEnvVar = "DD_SITE"
defaultDatadogSite = "datadoghq.com"
defaultDatadogHTTPTimout = 10 * time.Second
)

var errDatadogDeploymentGatesNotConfigured = errors.New("datadog deployment gates client is not configured")

type deploymentGateClient interface {
StartEvaluation(ctx context.Context, input deploymentGateInput) (string, error)
GetEvaluation(ctx context.Context, evaluationID string) (deploymentGateEvaluation, error)
}

type deploymentGateInput struct {
Service string
Env string
Identifier string
Version string
APMPrimaryTag string
}

type deploymentGateEvaluation struct {
ID string
URL string
Status string
Rules []deploymentGateRuleEvaluation
}

type deploymentGateRuleEvaluation struct {
Name string
Status string
Reason string
DryRun bool
}

type datadogDeploymentGatesClient struct {
apiClient *datadog.APIClient
apiKeys map[string]datadog.APIKey
serverVariables map[string]string
}

func newDatadogDeploymentGatesClientFromEnv() (deploymentGateClient, error) {
apiKey := firstNonEmpty(os.Getenv(datadogAPIKeyEnvVar), os.Getenv(datadogAltAPIKeyEnvVar))
if apiKey == "" {
return nil, fmt.Errorf("%w: missing %s or %s", errDatadogDeploymentGatesNotConfigured, datadogAPIKeyEnvVar, datadogAltAPIKeyEnvVar)
}

appKey := firstNonEmpty(os.Getenv(datadogAppKeyEnvVar), os.Getenv(datadogAltAppKeyEnvVar))
if appKey == "" {
return nil, fmt.Errorf("%w: missing %s or %s", errDatadogDeploymentGatesNotConfigured, datadogAppKeyEnvVar, datadogAltAppKeyEnvVar)
}

site := firstNonEmpty(os.Getenv(datadogSiteEnvVar), os.Getenv(datadogAltSiteEnvVar), defaultDatadogSite)
return newDatadogDeploymentGatesClient(site, apiKey, appKey, nil), nil
}

func newDatadogDeploymentGatesClient(baseURL string, apiKey string, appKey string, httpClient *http.Client) deploymentGateClient {
if httpClient == nil {
httpClient = &http.Client{Timeout: defaultDatadogHTTPTimout}
}

cfg := datadog.NewConfiguration()
cfg.HTTPClient = httpClient

client := &datadogDeploymentGatesClient{
apiClient: datadog.NewAPIClient(cfg),
apiKeys: map[string]datadog.APIKey{
"apiKeyAuth": {Key: apiKey},
"appKeyAuth": {Key: appKey},
},
}

trimmed := strings.TrimSpace(baseURL)
if strings.HasPrefix(trimmed, "https://") || strings.HasPrefix(trimmed, "http://") {
cfg.Servers = datadog.ServerConfigurations{
{
URL: strings.TrimRight(trimmed, "/"),
Description: "custom Datadog deployment gates server",
},
}
return client
}

client.serverVariables = map[string]string{
"site": normalizeDatadogSite(trimmed),
}

return client
}

func (c *datadogDeploymentGatesClient) StartEvaluation(ctx context.Context, input deploymentGateInput) (string, error) {
payload := map[string]any{
"data": map[string]any{
"type": "deployment_gates_evaluation_request",
"attributes": map[string]any{
"service": input.Service,
"env": input.Env,
"identifier": input.Identifier,
},
},
}

attributes := payload["data"].(map[string]any)["attributes"].(map[string]any)
if input.Version != "" {
attributes["version"] = input.Version
}
if input.APMPrimaryTag != "" {
attributes["apm_primary_tag"] = input.APMPrimaryTag
}

var response struct {
Data struct {
Attributes struct {
EvaluationID string `json:"evaluation_id"`
} `json:"attributes"`
} `json:"data"`
}

if err := c.do(ctx, http.MethodPost, "/api/unstable/deployments/gates/evaluation", payload, &response); err != nil {
return "", err
}

if response.Data.Attributes.EvaluationID == "" {
return "", errors.New("deployment gate evaluation response did not include an evaluation_id")
}

return response.Data.Attributes.EvaluationID, nil
}

func (c *datadogDeploymentGatesClient) GetEvaluation(ctx context.Context, evaluationID string) (deploymentGateEvaluation, error) {
var response struct {
Data struct {
ID string `json:"id"`
Attributes struct {
EvaluationURL string `json:"evaluation_url"`
GateStatus string `json:"gate_status"`
Rules []deploymentGateRuleEvaluation `json:"rules"`
} `json:"attributes"`
} `json:"data"`
}

if err := c.do(ctx, http.MethodGet, "/api/unstable/deployments/gates/evaluation/"+url.PathEscape(evaluationID), nil, &response); err != nil {
return deploymentGateEvaluation{}, err
}

return deploymentGateEvaluation{
ID: response.Data.ID,
URL: response.Data.Attributes.EvaluationURL,
Status: response.Data.Attributes.GateStatus,
Rules: response.Data.Attributes.Rules,
}, nil
}

func (c *datadogDeploymentGatesClient) do(ctx context.Context, method string, path string, body any, out any) error {
requestContext := c.requestContext(ctx)
baseURL, err := c.apiClient.Cfg.ServerURLWithContext(requestContext, "deploymentGatesEvaluation")
if err != nil {
return fmt.Errorf("resolve datadog deployment gates server URL: %w", err)
}

headerParams := map[string]string{
"Accept": "application/json",
}
if body != nil {
headerParams["Content-Type"] = "application/json"
}

datadog.SetAuthKeys(
requestContext,
&headerParams,
[2]string{"apiKeyAuth", "DD-API-KEY"},
[2]string{"appKeyAuth", "DD-APPLICATION-KEY"},
)

req, err := c.apiClient.PrepareRequest(
requestContext,
baseURL+path,
method,
body,
headerParams,
url.Values{},
url.Values{},
nil,
)
if err != nil {
return fmt.Errorf("prepare datadog deployment gates request: %w", err)
}

resp, err := c.apiClient.CallAPI(req)
if err != nil {
return fmt.Errorf("call datadog deployment gates API: %w", err)
}
if resp == nil {
return errors.New("datadog deployment gates API returned a nil response")
}

responseBody, err := datadog.ReadBody(resp)
if err != nil {
return fmt.Errorf("read datadog deployment gates API response: %w", err)
}

if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return fmt.Errorf("datadog deployment gates API returned %s: %s", resp.Status, strings.TrimSpace(string(responseBody)))
}

if err := c.apiClient.Decode(out, responseBody, resp.Header.Get("Content-Type")); err != nil {
return fmt.Errorf("decode datadog deployment gates API response: %w", err)
}

return nil
}

func (c *datadogDeploymentGatesClient) requestContext(ctx context.Context) context.Context {
if ctx == nil {
ctx = context.Background()
}

ctx = context.WithValue(ctx, datadog.ContextAPIKeys, c.apiKeys)
if len(c.serverVariables) > 0 {
ctx = context.WithValue(ctx, datadog.ContextServerVariables, c.serverVariables)
}

return ctx
}

func normalizeDatadogSite(site string) string {
trimmed := strings.TrimSpace(site)
if trimmed == "" {
return defaultDatadogSite
}

return strings.TrimPrefix(trimmed, "api.")
}

func firstNonEmpty(values ...string) string {
for _, value := range values {
if strings.TrimSpace(value) != "" {
return value
}
}

return ""
}
Loading
Loading