From db6ad55e8ffa982cfb425354f44a419b02cbc8e2 Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Thu, 8 Jan 2026 13:06:06 -0500 Subject: [PATCH 1/7] added Appinsights doc --- docs/designs/appinsights-metrics.md | 306 ++++++++++++++++++++++++++++ 1 file changed, 306 insertions(+) create mode 100644 docs/designs/appinsights-metrics.md diff --git a/docs/designs/appinsights-metrics.md b/docs/designs/appinsights-metrics.md new file mode 100644 index 00000000..28808456 --- /dev/null +++ b/docs/designs/appinsights-metrics.md @@ -0,0 +1,306 @@ +# Application Insights Telemetry Collection Specification + +## Overview +This document specifies all telemetry data points to be collected by Application Insights for the DocumentDB Kubernetes Operator. These metrics provide operational insights, usage patterns, and error tracking for operator deployments. + +--- + +## 1. Operator Lifecycle Metrics + +### Operator Startup Events +- **Event**: `OperatorStartup` +- **Properties**: + - `operator_version`: Semantic version of the operator + - `kubernetes_version`: K8s cluster version + - `cloud_provider`: Detected environment (`aks`, `eks`, `gke`, `unknown`) + - `startup_timestamp`: ISO 8601 timestamp + - `restart_count`: Number of restarts in the last hour + - `helm_chart_version`: Version of the Helm chart used (if applicable) + +### Operator Health Checks +- **Metric**: `operator.health.status` +- **Value**: `1` (healthy) or `0` (unhealthy) +- **Frequency**: Every 60 seconds +- **Dimensions**: `pod_name`, `namespace` + +--- + +## 2. Cluster Management Metrics + +### Cluster Count & Configuration +- **Metric**: `documentdb.clusters.active.count` +- **Description**: Total number of active DocumentDB clusters managed by the operator +- **Dimensions**: + - `namespace`: Kubernetes namespace + - `cloud_provider`: `aks`, `eks`, `gke` + - `environment`: `aks`, `eks`, `gke` (from spec.environment) + +### Cluster Size Metrics +- **Metric**: `documentdb.cluster.configuration` +- **Properties per cluster**: + - `cluster_name`: Name of the DocumentDB cluster + - `namespace`: Kubernetes namespace + - `node_count`: Number of nodes (currently always 1) + - `instances_per_node`: Number of instances per node (1-3) + - `total_instances`: node_count × instances_per_node + - `storage_class`: Storage class name + - `pvc_size`: Persistent volume claim size (e.g., "10Gi") + - `documentdb_version`: Version of DocumentDB components + - `postgresql_image`: Container image for PostgreSQL + - `gateway_image`: Container image for Gateway sidecar + +### Multi-Region Configuration +- **Metric**: `documentdb.cluster.replication.enabled` +- **Value**: `1` (enabled) or `0` (disabled) +- **Properties**: + - `cluster_name`: Name of the DocumentDB cluster + - `cross_cloud_networking_strategy`: `AzureFleet`, `Istio`, `None` + - `primary_cluster`: Name of the primary cluster + - `replica_count`: Number of clusters in replication list + - `high_availability`: Boolean indicating HA replicas on primary + - `participating_clusters`: Comma-separated list of cluster names + - `environments`: Comma-separated list of environments in replication + +--- + +## 3. Cluster Lifecycle Operations + +### Create Operations +- **Event**: `ClusterCreated` +- **Properties**: + - `cluster_name`: Name of the cluster + - `namespace`: Kubernetes namespace + - `creation_duration_seconds`: Time to create cluster + - `node_count`: Number of nodes + - `instances_per_node`: Instances per node + - `storage_size`: PVC size + - `cloud_provider`: Deployment environment + - `tls_enabled`: Boolean for TLS configuration + - `bootstrap_type`: `new` or `recovery` (if recovery, from backup) + - `sidecar_injector_plugin`: Plugin name if configured + - `service_type`: `LoadBalancer` or `ClusterIP` + +### Update Operations +- **Event**: `ClusterUpdated` +- **Properties**: + - `cluster_name`: Name of the cluster + - `namespace`: Kubernetes namespace + - `update_type`: `scale`, `version`, `configuration`, `storage` + - `previous_value`: Previous configuration value + - `new_value`: New configuration value + - `update_duration_seconds`: Time to apply update + +### Delete Operations +- **Event**: `ClusterDeleted` +- **Properties**: + - `cluster_name`: Name of the cluster + - `namespace`: Kubernetes namespace + - `deletion_duration_seconds`: Time to delete cluster + - `cluster_age_days`: Age of cluster at deletion + - `backup_count`: Number of backups associated with the cluster + +--- + +## 4. Backup & Restore Operations + +### Backup Operations +- **Event**: `BackupCreated` +- **Properties**: + - `backup_name`: Name of the backup + - `cluster_name`: Source cluster name + - `namespace`: Kubernetes namespace + - `backup_type`: `on-demand` or `scheduled` + - `backup_method`: `VolumeSnapshot` (CNPG method) + - `backup_size_bytes`: Size of the backup + - `backup_duration_seconds`: Time to complete backup + - `retention_days`: Configured retention period + - `backup_phase`: `starting`, `running`, `completed`, `failed`, `skipped` + - `cloud_provider`: Environment where backup was taken + - `is_primary_cluster`: Boolean indicating if backup from primary + +- **Event**: `BackupDeleted` +- **Properties**: + - `backup_name`: Name of the backup + - `deletion_reason`: `expired`, `manual`, `cluster-deleted` + - `backup_age_days`: Age of backup at deletion + +- **Metric**: `documentdb.backups.active.count` +- **Description**: Total number of active backups +- **Dimensions**: `namespace`, `cluster_name`, `backup_type` + +### Scheduled Backup Operations +- **Event**: `ScheduledBackupCreated` +- **Properties**: + - `scheduled_backup_name`: Name of the scheduled backup + - `cluster_name`: Target cluster name + - `schedule`: Cron expression + - `retention_days`: Retention policy + +- **Metric**: `documentdb.scheduled_backups.active.count` +- **Description**: Number of active scheduled backup jobs + +### Restore Operations +- **Event**: `ClusterRestored` +- **Properties**: + - `new_cluster_name`: Name of the restored cluster + - `source_backup_name`: Backup used for recovery + - `namespace`: Kubernetes namespace + - `restore_duration_seconds`: Time to restore from backup + - `backup_age_hours`: Age of backup at restore time + - `restore_phase`: `starting`, `running`, `completed`, `failed` + +--- + +## 5. Failover & High Availability Metrics + +### Failover Events +- **Event**: `FailoverOccurred` +- **Properties**: + - `cluster_name`: Name of the cluster + - `namespace`: Kubernetes namespace + - `failover_type`: `automatic`, `manual`, `switchover` + - `old_primary`: Previous primary instance + - `new_primary`: New primary instance + - `failover_duration_seconds`: Time to complete failover + - `downtime_seconds`: Observed downtime during failover + - `replication_lag_bytes`: Replication lag before failover + - `trigger_reason`: `node-failure`, `pod-crash`, `manual`, `health-check-failure` + +### Replication Health +- **Metric**: `documentdb.replication.lag.bytes` +- **Description**: Replication lag in bytes +- **Dimensions**: `cluster_name`, `replica_cluster`, `namespace` +- **Frequency**: Every 30 seconds + +- **Metric**: `documentdb.replication.status` +- **Value**: `1` (healthy) or `0` (unhealthy) +- **Dimensions**: `cluster_name`, `replica_cluster`, `namespace` + +--- + +## 6. Error Tracking + +### Reconciliation Errors +- **Event**: `ReconciliationError` +- **Properties**: + - `resource_type`: `DocumentDB`, `Backup`, `ScheduledBackup` + - `resource_name`: Name of the resource + - `namespace`: Kubernetes namespace + - `error_type`: `cluster-creation`, `backup-failure`, `restore-failure`, `volume-snapshot`, `replication-config`, `tls-cert` + - `error_message`: Sanitized error message (no PII) + - `error_code`: Standard error code + - `retry_count`: Number of retry attempts + - `resolution_status`: `pending`, `resolved`, `failed` + +### Volume Snapshot Errors +- **Event**: `VolumeSnapshotError` +- **Properties**: + - `backup_name`: Name of the backup + - `cluster_name`: Source cluster name + - `error_type`: `snapshot-class-missing`, `driver-unavailable`, `quota-exceeded`, `snapshot-failed` + - `csi_driver`: CSI driver name (`disk.csi.azure.com`, etc.) + - `cloud_provider`: Environment + +### CNPG Integration Errors +- **Event**: `CNPGIntegrationError` +- **Properties**: + - `cluster_name`: DocumentDB cluster name + - `cnpg_resource_type`: `Cluster`, `Backup`, `ScheduledBackup` + - `error_message`: Error from CNPG operator + - `operation`: `create`, `update`, `delete`, `status-sync` + +--- + +## 7. Feature Usage Metrics + +### TLS Configuration Usage +- **Metric**: `documentdb.tls.enabled.count` +- **Description**: Number of clusters with TLS enabled +- **Properties per cluster**: + - `tls_mode`: `manual-provided`, `cert-manager` + - `server_tls_enabled`: Boolean + - `client_tls_enabled`: Boolean + +### Service Exposure Methods +- **Metric**: `documentdb.service_exposure.count` +- **Dimensions**: + - `service_type`: `LoadBalancer`, `ClusterIP` + - `cloud_provider`: `aks`, `eks`, `gke` + +### Plugin Usage +- **Metric**: `documentdb.plugin.usage.count` +- **Properties**: + - `sidecar_injector_plugin`: Plugin name (if used) + - `wal_replica_plugin`: Plugin name (if used) + +--- + +## 8. Performance & Resource Metrics + +### Reconciliation Performance +- **Metric**: `documentdb.reconciliation.duration.seconds` +- **Description**: Time to reconcile resources +- **Dimensions**: `resource_type`, `operation`, `status` +- **Statistics**: p50, p95, p99 + +### API Call Latency +- **Metric**: `documentdb.api.duration.seconds` +- **Description**: Kubernetes API call duration +- **Dimensions**: `operation`, `resource_type`, `result` + +--- + +## 9. Compliance & Retention Metrics + +### Backup Retention Policy +- **Metric**: `documentdb.backup.retention.days` +- **Description**: Configured retention days per cluster +- **Dimensions**: `cluster_name`, `policy_level` (`cluster`, `backup`, `scheduled-backup`) + +### Expired Backups +- **Event**: `BackupExpired` +- **Properties**: + - `backup_name`: Name of the expired backup + - `cluster_name`: Source cluster + - `retention_days`: Configured retention + - `actual_age_days`: Actual age at expiration + +--- + +## 10. Deployment Context + +### Cluster Environment +- **Properties** (collected once at startup, attached to all events): + - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `other` + - `kubernetes_version`: K8s version + - `region`: Cloud region (if detectable) + - `operator_namespace`: Namespace where operator runs + - `installation_method`: `helm`, `kubectl`, `operator-sdk` + +--- + +## Data Privacy & Security + +- **No PII**: Do not collect usernames, passwords, connection strings, or IP addresses +- **Sanitize errors**: Remove sensitive data from error messages +- **Cluster names**: Use hashed cluster names if privacy required +- **Opt-out**: Provide mechanism to disable telemetry collection + +--- + +## Implementation Notes + +1. **Sampling**: Apply sampling for high-frequency metrics (e.g., reconciliation events) +2. **Batching**: Batch events in 30-second windows to reduce API calls +3. **Cardinality**: Monitor dimension cardinality to avoid explosion +4. **Retry logic**: Implement exponential backoff for telemetry submission failures +5. **Local buffering**: Buffer events locally if Application Insights is unreachable +6. **Health endpoint**: Expose `/metrics` endpoint for Prometheus scraping + +--- + +## Revision History + +| Date | Version | Changes | +|------|---------|---------| +| 2026-01-08 | 1.0 | Initial specification | From 0165f5c45ffb1c584f610f8215a9296650aae4b1 Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Mon, 2 Feb 2026 11:59:17 -0500 Subject: [PATCH 2/7] dixed PII --- docs/designs/appinsights-metrics.md | 112 ++++++++++++++-------------- 1 file changed, 56 insertions(+), 56 deletions(-) diff --git a/docs/designs/appinsights-metrics.md b/docs/designs/appinsights-metrics.md index 28808456..286ab1c2 100644 --- a/docs/designs/appinsights-metrics.md +++ b/docs/designs/appinsights-metrics.md @@ -31,34 +31,31 @@ This document specifies all telemetry data points to be collected by Application - **Metric**: `documentdb.clusters.active.count` - **Description**: Total number of active DocumentDB clusters managed by the operator - **Dimensions**: - - `namespace`: Kubernetes namespace + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `cloud_provider`: `aks`, `eks`, `gke` - `environment`: `aks`, `eks`, `gke` (from spec.environment) ### Cluster Size Metrics - **Metric**: `documentdb.cluster.configuration` - **Properties per cluster**: - - `cluster_name`: Name of the DocumentDB cluster - - `namespace`: Kubernetes namespace + - `cluster_id`: Auto-generated GUID for the DocumentDB cluster (for correlation without PII) + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `node_count`: Number of nodes (currently always 1) - `instances_per_node`: Number of instances per node (1-3) - `total_instances`: node_count × instances_per_node - - `storage_class`: Storage class name - - `pvc_size`: Persistent volume claim size (e.g., "10Gi") + - `pvc_size_category`: PVC size category (`small` <50Gi, `medium` 50-200Gi, `large` >200Gi) - `documentdb_version`: Version of DocumentDB components - - `postgresql_image`: Container image for PostgreSQL - - `gateway_image`: Container image for Gateway sidecar ### Multi-Region Configuration - **Metric**: `documentdb.cluster.replication.enabled` - **Value**: `1` (enabled) or `0` (disabled) - **Properties**: - - `cluster_name`: Name of the DocumentDB cluster + - `cluster_id`: Auto-generated GUID for the DocumentDB cluster - `cross_cloud_networking_strategy`: `AzureFleet`, `Istio`, `None` - - `primary_cluster`: Name of the primary cluster + - `primary_cluster_id`: GUID of the primary cluster - `replica_count`: Number of clusters in replication list - `high_availability`: Boolean indicating HA replicas on primary - - `participating_clusters`: Comma-separated list of cluster names + - `participating_cluster_count`: Number of participating clusters - `environments`: Comma-separated list of environments in replication --- @@ -68,8 +65,8 @@ This document specifies all telemetry data points to be collected by Application ### Create Operations - **Event**: `ClusterCreated` - **Properties**: - - `cluster_name`: Name of the cluster - - `namespace`: Kubernetes namespace + - `cluster_id`: Auto-generated GUID for the cluster + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `creation_duration_seconds`: Time to create cluster - `node_count`: Number of nodes - `instances_per_node`: Instances per node @@ -77,24 +74,22 @@ This document specifies all telemetry data points to be collected by Application - `cloud_provider`: Deployment environment - `tls_enabled`: Boolean for TLS configuration - `bootstrap_type`: `new` or `recovery` (if recovery, from backup) - - `sidecar_injector_plugin`: Plugin name if configured + - `sidecar_injector_plugin`: Boolean indicating if plugin is configured - `service_type`: `LoadBalancer` or `ClusterIP` ### Update Operations - **Event**: `ClusterUpdated` - **Properties**: - - `cluster_name`: Name of the cluster - - `namespace`: Kubernetes namespace + - `cluster_id`: Auto-generated GUID for the cluster + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `update_type`: `scale`, `version`, `configuration`, `storage` - - `previous_value`: Previous configuration value - - `new_value`: New configuration value - `update_duration_seconds`: Time to apply update ### Delete Operations - **Event**: `ClusterDeleted` - **Properties**: - - `cluster_name`: Name of the cluster - - `namespace`: Kubernetes namespace + - `cluster_id`: Auto-generated GUID for the cluster + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `deletion_duration_seconds`: Time to delete cluster - `cluster_age_days`: Age of cluster at deletion - `backup_count`: Number of backups associated with the cluster @@ -106,9 +101,9 @@ This document specifies all telemetry data points to be collected by Application ### Backup Operations - **Event**: `BackupCreated` - **Properties**: - - `backup_name`: Name of the backup - - `cluster_name`: Source cluster name - - `namespace`: Kubernetes namespace + - `backup_id`: Auto-generated GUID for the backup + - `cluster_id`: GUID of the source cluster + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `backup_type`: `on-demand` or `scheduled` - `backup_method`: `VolumeSnapshot` (CNPG method) - `backup_size_bytes`: Size of the backup @@ -120,20 +115,20 @@ This document specifies all telemetry data points to be collected by Application - **Event**: `BackupDeleted` - **Properties**: - - `backup_name`: Name of the backup + - `backup_id`: GUID of the backup - `deletion_reason`: `expired`, `manual`, `cluster-deleted` - `backup_age_days`: Age of backup at deletion - **Metric**: `documentdb.backups.active.count` - **Description**: Total number of active backups -- **Dimensions**: `namespace`, `cluster_name`, `backup_type` +- **Dimensions**: `namespace_hash`, `cluster_id`, `backup_type` ### Scheduled Backup Operations - **Event**: `ScheduledBackupCreated` - **Properties**: - - `scheduled_backup_name`: Name of the scheduled backup - - `cluster_name`: Target cluster name - - `schedule`: Cron expression + - `scheduled_backup_id`: Auto-generated GUID for the scheduled backup + - `cluster_id`: GUID of the target cluster + - `schedule_frequency`: Frequency category (`hourly`, `daily`, `weekly`, `custom`) - `retention_days`: Retention policy - **Metric**: `documentdb.scheduled_backups.active.count` @@ -142,9 +137,9 @@ This document specifies all telemetry data points to be collected by Application ### Restore Operations - **Event**: `ClusterRestored` - **Properties**: - - `new_cluster_name`: Name of the restored cluster - - `source_backup_name`: Backup used for recovery - - `namespace`: Kubernetes namespace + - `new_cluster_id`: Auto-generated GUID for the restored cluster + - `source_backup_id`: GUID of the backup used for recovery + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `restore_duration_seconds`: Time to restore from backup - `backup_age_hours`: Age of backup at restore time - `restore_phase`: `starting`, `running`, `completed`, `failed` @@ -156,11 +151,11 @@ This document specifies all telemetry data points to be collected by Application ### Failover Events - **Event**: `FailoverOccurred` - **Properties**: - - `cluster_name`: Name of the cluster - - `namespace`: Kubernetes namespace + - `cluster_id`: Auto-generated GUID for the cluster + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `failover_type`: `automatic`, `manual`, `switchover` - - `old_primary`: Previous primary instance - - `new_primary`: New primary instance + - `old_primary_index`: Index of the previous primary instance (e.g., 0, 1, 2) + - `new_primary_index`: Index of the new primary instance - `failover_duration_seconds`: Time to complete failover - `downtime_seconds`: Observed downtime during failover - `replication_lag_bytes`: Replication lag before failover @@ -168,13 +163,14 @@ This document specifies all telemetry data points to be collected by Application ### Replication Health - **Metric**: `documentdb.replication.lag.bytes` -- **Description**: Replication lag in bytes -- **Dimensions**: `cluster_name`, `replica_cluster`, `namespace` -- **Frequency**: Every 30 seconds +- **Description**: Replication lag in bytes (aggregated over 2-hour windows) +- **Dimensions**: `cluster_id`, `replica_cluster_id`, `namespace_hash` +- **Statistics**: min, max, avg (reported as tuple) +- **Frequency**: Every 2 hours (aggregated) - **Metric**: `documentdb.replication.status` - **Value**: `1` (healthy) or `0` (unhealthy) -- **Dimensions**: `cluster_name`, `replica_cluster`, `namespace` +- **Dimensions**: `cluster_id`, `replica_cluster_id`, `namespace_hash` --- @@ -184,8 +180,8 @@ This document specifies all telemetry data points to be collected by Application - **Event**: `ReconciliationError` - **Properties**: - `resource_type`: `DocumentDB`, `Backup`, `ScheduledBackup` - - `resource_name`: Name of the resource - - `namespace`: Kubernetes namespace + - `resource_id`: Auto-generated GUID of the resource + - `namespace_hash`: SHA-256 hash of the Kubernetes namespace - `error_type`: `cluster-creation`, `backup-failure`, `restore-failure`, `volume-snapshot`, `replication-config`, `tls-cert` - `error_message`: Sanitized error message (no PII) - `error_code`: Standard error code @@ -195,18 +191,18 @@ This document specifies all telemetry data points to be collected by Application ### Volume Snapshot Errors - **Event**: `VolumeSnapshotError` - **Properties**: - - `backup_name`: Name of the backup - - `cluster_name`: Source cluster name + - `backup_id`: GUID of the backup + - `cluster_id`: GUID of the source cluster - `error_type`: `snapshot-class-missing`, `driver-unavailable`, `quota-exceeded`, `snapshot-failed` - - `csi_driver`: CSI driver name (`disk.csi.azure.com`, etc.) + - `csi_driver_type`: CSI driver type (`azure-disk`, `aws-ebs`, `gce-pd`, `other`) - `cloud_provider`: Environment ### CNPG Integration Errors - **Event**: `CNPGIntegrationError` - **Properties**: - - `cluster_name`: DocumentDB cluster name + - `cluster_id`: GUID of the DocumentDB cluster - `cnpg_resource_type`: `Cluster`, `Backup`, `ScheduledBackup` - - `error_message`: Error from CNPG operator + - `error_category`: Categorized error type (no raw error messages) - `operation`: `create`, `update`, `delete`, `status-sync` --- @@ -230,8 +226,8 @@ This document specifies all telemetry data points to be collected by Application ### Plugin Usage - **Metric**: `documentdb.plugin.usage.count` - **Properties**: - - `sidecar_injector_plugin`: Plugin name (if used) - - `wal_replica_plugin`: Plugin name (if used) + - `sidecar_injector_plugin_enabled`: Boolean indicating if plugin is used + - `wal_replica_plugin_enabled`: Boolean indicating if plugin is used --- @@ -255,13 +251,13 @@ This document specifies all telemetry data points to be collected by Application ### Backup Retention Policy - **Metric**: `documentdb.backup.retention.days` - **Description**: Configured retention days per cluster -- **Dimensions**: `cluster_name`, `policy_level` (`cluster`, `backup`, `scheduled-backup`) +- **Dimensions**: `cluster_id`, `policy_level` (`cluster`, `backup`, `scheduled-backup`) ### Expired Backups - **Event**: `BackupExpired` - **Properties**: - - `backup_name`: Name of the expired backup - - `cluster_name`: Source cluster + - `backup_id`: GUID of the expired backup + - `cluster_id`: GUID of the source cluster - `retention_days`: Configured retention - `actual_age_days`: Actual age at expiration @@ -271,10 +267,10 @@ This document specifies all telemetry data points to be collected by Application ### Cluster Environment - **Properties** (collected once at startup, attached to all events): - - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `other` + - `kubernetes_distribution`: `aks`, `eks`, `gke`, `openshift`, `rancher`, `vmware-tanzu`, `other` - `kubernetes_version`: K8s version - - `region`: Cloud region (if detectable) - - `operator_namespace`: Namespace where operator runs + - `region`: Cloud region (from `topology.kubernetes.io/region` label if available) + - `operator_namespace_hash`: SHA-256 hash of the namespace where operator runs - `installation_method`: `helm`, `kubectl`, `operator-sdk` --- @@ -282,8 +278,11 @@ This document specifies all telemetry data points to be collected by Application ## Data Privacy & Security - **No PII**: Do not collect usernames, passwords, connection strings, or IP addresses -- **Sanitize errors**: Remove sensitive data from error messages -- **Cluster names**: Use hashed cluster names if privacy required +- **Resource Identifiers**: Use auto-generated GUIDs for cluster, backup, and resource identification instead of user-provided names +- **Namespace Protection**: Use SHA-256 hashed namespace values to prevent leaking organizational structure +- **Storage Class**: Do not collect storage class names (may contain PII) +- **Sanitize errors**: Remove sensitive data from error messages; use error categories instead of raw messages +- **GUID Correlation**: GUIDs are generated and stored in resource annotations for event correlation - **Opt-out**: Provide mechanism to disable telemetry collection --- @@ -295,7 +294,7 @@ This document specifies all telemetry data points to be collected by Application 3. **Cardinality**: Monitor dimension cardinality to avoid explosion 4. **Retry logic**: Implement exponential backoff for telemetry submission failures 5. **Local buffering**: Buffer events locally if Application Insights is unreachable -6. **Health endpoint**: Expose `/metrics` endpoint for Prometheus scraping +6. **GUID Generation**: Generate and persist GUIDs in resource annotations (`telemetry.documentdb.io/cluster-id`) at resource creation time --- @@ -304,3 +303,4 @@ This document specifies all telemetry data points to be collected by Application | Date | Version | Changes | |------|---------|---------| | 2026-01-08 | 1.0 | Initial specification | +| 2026-01-29 | 1.1 | Address PII concerns: replaced cluster/backup names with GUIDs, hashed namespaces, removed storage class and container image names, categorized errors instead of raw messages, added more kubernetes distributions | From 7e64d7fb04cf93e12bf2c26f5fc0937eee29876a Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Tue, 3 Feb 2026 17:10:47 -0500 Subject: [PATCH 3/7] Implement Application Insights telemetry integration - Add internal/telemetry package with: - types.go: Event and metric type definitions per spec - client.go: Application Insights client with batching, retry, and buffering - events.go: Event tracking helper functions - metrics.go: Metric tracking helper functions - guid.go: GUID generation for privacy-preserving resource correlation - utils.go: Utility functions for hashing, categorization - manager.go: Central telemetry manager - Integrate telemetry into controllers: - DocumentDBReconciler: Track cluster lifecycle, reconciliation metrics - BackupReconciler: Track backup events - ScheduledBackupReconciler: Track scheduled backup events - Update main.go to initialize telemetry on startup - Add google/uuid dependency for GUID generation - Add telemetry-configuration.md documentation Uses placeholder for Application Insights key via environment variables: - APPINSIGHTS_INSTRUMENTATIONKEY - APPLICATIONINSIGHTS_CONNECTION_STRING - DOCUMENTDB_TELEMETRY_ENABLED (set to false to disable) --- docs/designs/telemetry-configuration.md | 132 ++++++ operator/src/cmd/main.go | 51 +- operator/src/go.mod | 1 + .../internal/controller/backup_controller.go | 6 +- .../controller/documentdb_controller.go | 108 ++++- .../controller/scheduledbackup_controller.go | 6 +- operator/src/internal/telemetry/client.go | 444 ++++++++++++++++++ operator/src/internal/telemetry/events.go | 180 +++++++ operator/src/internal/telemetry/guid.go | 113 +++++ operator/src/internal/telemetry/manager.go | 241 ++++++++++ operator/src/internal/telemetry/metrics.go | 157 +++++++ operator/src/internal/telemetry/types.go | 247 ++++++++++ operator/src/internal/telemetry/utils.go | 124 +++++ 13 files changed, 1793 insertions(+), 17 deletions(-) create mode 100644 docs/designs/telemetry-configuration.md create mode 100644 operator/src/internal/telemetry/client.go create mode 100644 operator/src/internal/telemetry/events.go create mode 100644 operator/src/internal/telemetry/guid.go create mode 100644 operator/src/internal/telemetry/manager.go create mode 100644 operator/src/internal/telemetry/metrics.go create mode 100644 operator/src/internal/telemetry/types.go create mode 100644 operator/src/internal/telemetry/utils.go diff --git a/docs/designs/telemetry-configuration.md b/docs/designs/telemetry-configuration.md new file mode 100644 index 00000000..152501a2 --- /dev/null +++ b/docs/designs/telemetry-configuration.md @@ -0,0 +1,132 @@ +# Application Insights Telemetry Configuration + +This document describes how to configure Application Insights telemetry collection for the DocumentDB Kubernetes Operator. + +## Overview + +The DocumentDB Operator can send telemetry data to Azure Application Insights to help monitor operator health, track cluster lifecycle events, and diagnose issues. All telemetry is designed with privacy in mind - no personally identifiable information (PII) is collected. + +## Configuration + +### Environment Variables + +Configure telemetry by setting these environment variables in the operator deployment: + +| Variable | Description | Required | +|----------|-------------|----------| +| `APPINSIGHTS_INSTRUMENTATIONKEY` | Application Insights instrumentation key | Yes (or connection string) | +| `APPLICATIONINSIGHTS_CONNECTION_STRING` | Application Insights connection string (alternative to instrumentation key) | Yes (or instrumentation key) | +| `DOCUMENTDB_TELEMETRY_ENABLED` | Set to `false` to disable telemetry collection | No (default: `true`) | + +### Helm Chart Configuration + +When installing via Helm, you can configure telemetry in your values.yaml: + +```yaml +# values.yaml +telemetry: + enabled: true + appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE" + # Or use connection string: + # appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..." +``` + +### Kubernetes Secret + +For production deployments, store the instrumentation key in a Kubernetes secret: + +```yaml +apiVersion: v1 +kind: Secret +metadata: + name: documentdb-operator-telemetry + namespace: documentdb-system +type: Opaque +stringData: + APPINSIGHTS_INSTRUMENTATIONKEY: "YOUR-INSTRUMENTATION-KEY-HERE" +``` + +Then reference it in the operator deployment: + +```yaml +envFrom: + - secretRef: + name: documentdb-operator-telemetry +``` + +## Privacy & Data Collection + +### What We Collect + +The operator collects anonymous, aggregated telemetry data including: + +- **Operator lifecycle**: Startup events, health status, version information +- **Cluster operations**: Create, update, delete events (with timing metrics) +- **Backup operations**: Backup creation, completion, and expiration events +- **Error tracking**: Categorized errors (no raw error messages with sensitive data) +- **Performance metrics**: Reconciliation duration, API call latency + +### What We DON'T Collect + +To protect your privacy, we explicitly do NOT collect: + +- Cluster names, namespace names, or any user-provided resource names +- Connection strings, passwords, or credentials +- IP addresses or hostnames +- Storage class names (may contain organizational information) +- Raw error messages (only categorized error types) +- Container image names + +### Privacy Protection Mechanisms + +1. **GUIDs Instead of Names**: All resources are identified by auto-generated GUIDs stored in annotations (`telemetry.documentdb.io/cluster-id`) +2. **Hashed Namespaces**: Namespace names are SHA-256 hashed before transmission +3. **Categorized Data**: Values like PVC sizes are categorized (small/medium/large) instead of exact values +4. **Error Sanitization**: Error messages are stripped of potential PII and truncated + +## Disabling Telemetry + +To completely disable telemetry collection: + +1. **Via environment variable**: + ```yaml + env: + - name: DOCUMENTDB_TELEMETRY_ENABLED + value: "false" + ``` + +2. **Via Helm**: + ```yaml + telemetry: + enabled: false + ``` + +3. **Don't provide instrumentation key**: If no `APPINSIGHTS_INSTRUMENTATIONKEY` or `APPLICATIONINSIGHTS_CONNECTION_STRING` is set, telemetry is automatically disabled. + +## Telemetry Events Reference + +See [appinsights-metrics.md](appinsights-metrics.md) for the complete specification of all telemetry events and metrics collected. + +## Troubleshooting + +### Telemetry Not Being Sent + +1. Verify the instrumentation key is correctly configured: + ```bash + kubectl get deployment documentdb-operator -n documentdb-system -o yaml | grep -A5 APPINSIGHTS + ``` + +2. Check operator logs for telemetry initialization: + ```bash + kubectl logs -n documentdb-system -l app=documentdb-operator | grep -i telemetry + ``` + +3. Verify network connectivity to Application Insights endpoint (`dc.services.visualstudio.com`) + +### High Cardinality Warnings + +If you see warnings about high cardinality dimensions, this indicates too many unique values for a dimension. The telemetry system automatically samples high-frequency events to mitigate this. + +## Support + +For issues related to telemetry collection, please open an issue on the [GitHub repository](https://github.com/documentdb/documentdb-kubernetes-operator/issues). diff --git a/operator/src/cmd/main.go b/operator/src/cmd/main.go index e0370c12..794e5c86 100644 --- a/operator/src/cmd/main.go +++ b/operator/src/cmd/main.go @@ -4,6 +4,7 @@ package main import ( + "context" "crypto/tls" "flag" "os" @@ -29,10 +30,17 @@ import ( cnpgv1 "github.com/cloudnative-pg/cloudnative-pg/api/v1" dbpreview "github.com/documentdb/documentdb-operator/api/preview" "github.com/documentdb/documentdb-operator/internal/controller" + "github.com/documentdb/documentdb-operator/internal/telemetry" fleetv1alpha1 "go.goms.io/fleet-networking/api/v1alpha1" // +kubebuilder:scaffold:imports ) +// Version information - set via ldflags at build time +var ( + version = "dev" + helmChartVersion = "" +) + var ( scheme = runtime.NewScheme() setupLog = ctrl.Log.WithName("setup") @@ -211,29 +219,52 @@ func main() { os.Exit(1) } + // Initialize telemetry + telemetryMgr, err := telemetry.NewManager( + context.Background(), + telemetry.ManagerConfig{ + OperatorVersion: version, + HelmChartVersion: helmChartVersion, + Logger: setupLog, + }, + mgr.GetClient(), + clientset, + ) + if err != nil { + setupLog.Error(err, "unable to initialize telemetry manager") + // Continue without telemetry - it's not critical + } else { + telemetryMgr.Start() + defer telemetryMgr.Stop() + setupLog.Info("Telemetry initialized", "enabled", telemetryMgr.IsEnabled()) + } + if err = (&controller.DocumentDBReconciler{ - Client: mgr.GetClient(), - Scheme: mgr.GetScheme(), - Config: mgr.GetConfig(), - Clientset: clientset, + Client: mgr.GetClient(), + Scheme: mgr.GetScheme(), + Config: mgr.GetConfig(), + Clientset: clientset, + TelemetryMgr: telemetryMgr, }).SetupWithManager(mgr); err != nil { setupLog.Error(err, "unable to create controller", "controller", "DocumentDB") os.Exit(1) } if err = (&controller.BackupReconciler{ - Client: mgr.GetClient(), - Scheme: mgr.GetScheme(), - Recorder: mgr.GetEventRecorderFor("backup-controller"), + Client: mgr.GetClient(), + Scheme: mgr.GetScheme(), + Recorder: mgr.GetEventRecorderFor("backup-controller"), + TelemetryMgr: telemetryMgr, }).SetupWithManager(mgr); err != nil { setupLog.Error(err, "unable to create controller", "controller", "Backup") os.Exit(1) } if err = (&controller.ScheduledBackupReconciler{ - Client: mgr.GetClient(), - Scheme: mgr.GetScheme(), - Recorder: mgr.GetEventRecorderFor("scheduled-backup-controller"), + Client: mgr.GetClient(), + Scheme: mgr.GetScheme(), + Recorder: mgr.GetEventRecorderFor("scheduled-backup-controller"), + TelemetryMgr: telemetryMgr, }).SetupWithManager(mgr); err != nil { setupLog.Error(err, "unable to create controller", "controller", "ScheduledBackup") os.Exit(1) diff --git a/operator/src/go.mod b/operator/src/go.mod index e041db32..b7c4396d 100644 --- a/operator/src/go.mod +++ b/operator/src/go.mod @@ -9,6 +9,7 @@ require ( github.com/cloudnative-pg/cloudnative-pg v1.25.1 github.com/cloudnative-pg/machinery v0.1.0 github.com/go-logr/logr v1.4.2 + github.com/google/uuid v1.6.0 github.com/onsi/ginkgo/v2 v2.22.2 github.com/onsi/gomega v1.36.2 github.com/stretchr/testify v1.11.1 diff --git a/operator/src/internal/controller/backup_controller.go b/operator/src/internal/controller/backup_controller.go index d32707ef..748e630c 100644 --- a/operator/src/internal/controller/backup_controller.go +++ b/operator/src/internal/controller/backup_controller.go @@ -19,14 +19,16 @@ import ( "sigs.k8s.io/controller-runtime/pkg/log" dbpreview "github.com/documentdb/documentdb-operator/api/preview" + "github.com/documentdb/documentdb-operator/internal/telemetry" util "github.com/documentdb/documentdb-operator/internal/utils" ) // BackupReconciler reconciles a Backup object type BackupReconciler struct { client.Client - Scheme *runtime.Scheme - Recorder record.EventRecorder + Scheme *runtime.Scheme + Recorder record.EventRecorder + TelemetryMgr *telemetry.Manager } // Reconcile handles the reconciliation loop for Backup resources. diff --git a/operator/src/internal/controller/documentdb_controller.go b/operator/src/internal/controller/documentdb_controller.go index cb6d8010..3c919456 100644 --- a/operator/src/internal/controller/documentdb_controller.go +++ b/operator/src/internal/controller/documentdb_controller.go @@ -33,6 +33,7 @@ import ( dbpreview "github.com/documentdb/documentdb-operator/api/preview" cnpg "github.com/documentdb/documentdb-operator/internal/cnpg" + "github.com/documentdb/documentdb-operator/internal/telemetry" util "github.com/documentdb/documentdb-operator/internal/utils" ) @@ -44,9 +45,10 @@ const ( // DocumentDBReconciler reconciles a DocumentDB object type DocumentDBReconciler struct { client.Client - Scheme *runtime.Scheme - Config *rest.Config - Clientset kubernetes.Interface + Scheme *runtime.Scheme + Config *rest.Config + Clientset kubernetes.Interface + TelemetryMgr *telemetry.Manager } var reconcileMutex sync.Mutex @@ -55,6 +57,7 @@ var reconcileMutex sync.Mutex // +kubebuilder:rbac:groups=documentdb.io,resources=dbs/status,verbs=get;update;patch // +kubebuilder:rbac:groups=documentdb.io,resources=dbs/finalizers,verbs=update func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { + reconcileStart := time.Now() reconcileMutex.Lock() defer reconcileMutex.Unlock() @@ -73,9 +76,22 @@ func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) return ctrl.Result{}, nil } logger.Error(err, "Failed to get DocumentDB resource") + r.trackReconcileError(ctx, "DocumentDB", req.Name, req.Namespace, "get-resource", err) return ctrl.Result{}, err } + // Ensure cluster has telemetry ID + if r.TelemetryMgr != nil && r.TelemetryMgr.IsEnabled() { + if _, err := r.TelemetryMgr.GUIDs.GetOrCreateClusterID(ctx, documentdb); err != nil { + logger.V(1).Info("Failed to create telemetry ID for cluster", "error", err) + } + } + + // Track reconciliation at the end + defer func() { + r.trackReconcileDuration(ctx, "DocumentDB", "reconcile", time.Since(reconcileStart).Seconds(), err == nil) + }() + replicationContext, err := util.GetReplicationContext(ctx, r.Client, *documentdb) if err != nil { logger.Error(err, "Failed to determine replication context") @@ -469,3 +485,89 @@ func (r *DocumentDBReconciler) executeSQLCommand(ctx context.Context, cluster *c return stdout.String(), nil } + +// trackReconcileError tracks reconciliation errors to telemetry. +func (r *DocumentDBReconciler) trackReconcileError(ctx context.Context, resourceType, resourceName, namespace, errorType string, err error) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + r.TelemetryMgr.Events.TrackReconciliationError(telemetry.ReconciliationErrorEvent{ + ResourceType: resourceType, + ResourceID: resourceName, // Will be replaced with GUID when available + NamespaceHash: telemetry.HashNamespace(namespace), + ErrorType: errorType, + ErrorMessage: sanitizeError(err), + ResolutionStatus: "pending", + }) +} + +// trackReconcileDuration tracks reconciliation duration to telemetry. +func (r *DocumentDBReconciler) trackReconcileDuration(ctx context.Context, resourceType, operation string, durationSeconds float64, success bool) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + status := "success" + if !success { + status = "error" + } + + r.TelemetryMgr.Metrics.TrackReconciliationDuration(telemetry.ReconciliationDurationMetric{ + ResourceType: resourceType, + Operation: operation, + Status: status, + DurationSeconds: durationSeconds, + }) +} + +// TrackClusterCreated tracks when a new cluster is created. +func (r *DocumentDBReconciler) TrackClusterCreated(ctx context.Context, documentdb *dbpreview.DocumentDB, durationSeconds float64) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + clusterID := r.TelemetryMgr.GUIDs.GetClusterID(documentdb) + bootstrapType := "new" + if documentdb.Spec.Bootstrap != nil && documentdb.Spec.Bootstrap.Recovery != nil { + bootstrapType = "recovery" + } + + r.TelemetryMgr.Events.TrackClusterCreated(telemetry.ClusterCreatedEvent{ + ClusterID: clusterID, + NamespaceHash: telemetry.HashNamespace(documentdb.Namespace), + CreationDurationSeconds: durationSeconds, + NodeCount: documentdb.Spec.NodeCount, + InstancesPerNode: documentdb.Spec.InstancesPerNode, + StorageSize: documentdb.Spec.Resource.Storage.PvcSize, + CloudProvider: telemetry.MapCloudProviderToString(documentdb.Spec.Environment), + TLSEnabled: documentdb.Spec.TLS != nil, + BootstrapType: bootstrapType, + SidecarInjectorPlugin: documentdb.Spec.SidecarInjectorPluginName != "", + ServiceType: documentdb.Spec.ExposeViaService.ServiceType, + }) + + // Also track cluster configuration metric + r.TelemetryMgr.Metrics.TrackClusterConfiguration(telemetry.ClusterConfigurationMetric{ + ClusterID: clusterID, + NamespaceHash: telemetry.HashNamespace(documentdb.Namespace), + NodeCount: documentdb.Spec.NodeCount, + InstancesPerNode: documentdb.Spec.InstancesPerNode, + TotalInstances: documentdb.Spec.NodeCount * documentdb.Spec.InstancesPerNode, + PVCSizeCategory: telemetry.CategorizePVCSize(documentdb.Spec.Resource.Storage.PvcSize), + DocumentDBVersion: documentdb.Spec.DocumentDBVersion, + }) +} + +// sanitizeError removes potential PII from error messages. +func sanitizeError(err error) string { + if err == nil { + return "" + } + msg := err.Error() + // Truncate long messages + if len(msg) > 200 { + msg = msg[:200] + "..." + } + return msg +} diff --git a/operator/src/internal/controller/scheduledbackup_controller.go b/operator/src/internal/controller/scheduledbackup_controller.go index 0086eedf..bb8cea9b 100644 --- a/operator/src/internal/controller/scheduledbackup_controller.go +++ b/operator/src/internal/controller/scheduledbackup_controller.go @@ -18,13 +18,15 @@ import ( "sigs.k8s.io/controller-runtime/pkg/log" dbpreview "github.com/documentdb/documentdb-operator/api/preview" + "github.com/documentdb/documentdb-operator/internal/telemetry" ) // ScheduledBackupReconciler reconciles a ScheduledBackup object type ScheduledBackupReconciler struct { client.Client - Scheme *runtime.Scheme - Recorder record.EventRecorder + Scheme *runtime.Scheme + Recorder record.EventRecorder + TelemetryMgr *telemetry.Manager } // Reconcile handles the reconciliation loop for ScheduledBackup resources. diff --git a/operator/src/internal/telemetry/client.go b/operator/src/internal/telemetry/client.go new file mode 100644 index 00000000..a69f0241 --- /dev/null +++ b/operator/src/internal/telemetry/client.go @@ -0,0 +1,444 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package telemetry + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "net/http" + "os" + "sync" + "time" + + "github.com/go-logr/logr" +) + +const ( + // DefaultBatchInterval is the default interval for batching telemetry events. + DefaultBatchInterval = 30 * time.Second + // DefaultMaxBatchSize is the maximum number of events to batch before sending. + DefaultMaxBatchSize = 100 + // DefaultMaxRetries is the maximum number of retries for failed telemetry submissions. + DefaultMaxRetries = 3 + // DefaultRetryBaseDelay is the base delay for exponential backoff retries. + DefaultRetryBaseDelay = 1 * time.Second + // DefaultBufferSize is the size of the local buffer for events when AppInsights is unreachable. + DefaultBufferSize = 1000 + + // EnvAppInsightsKey is the environment variable for the Application Insights instrumentation key. + EnvAppInsightsKey = "APPINSIGHTS_INSTRUMENTATIONKEY" + // EnvAppInsightsConnectionString is the environment variable for the Application Insights connection string. + EnvAppInsightsConnectionString = "APPLICATIONINSIGHTS_CONNECTION_STRING" + // EnvTelemetryEnabled is the environment variable to enable/disable telemetry. + EnvTelemetryEnabled = "DOCUMENTDB_TELEMETRY_ENABLED" + + // AppInsightsTrackEndpoint is the Application Insights ingestion endpoint. + AppInsightsTrackEndpoint = "https://dc.services.visualstudio.com/v2/track" +) + +// TelemetryClient handles sending telemetry to Application Insights. +type TelemetryClient struct { + instrumentationKey string + ingestionEndpoint string + enabled bool + operatorContext *OperatorContext + logger logr.Logger + + // Batching + eventBuffer []telemetryEnvelope + bufferMutex sync.Mutex + batchInterval time.Duration + maxBatchSize int + + // Retry and buffering + maxRetries int + retryBaseDelay time.Duration + localBuffer []telemetryEnvelope + localMutex sync.Mutex + maxBufferSize int + + // HTTP client + httpClient *http.Client + + // Shutdown + stopChan chan struct{} + wg sync.WaitGroup +} + +// telemetryEnvelope wraps events for Application Insights API. +type telemetryEnvelope struct { + Name string `json:"name"` + Time string `json:"time"` + IKey string `json:"iKey"` + Tags map[string]string `json:"tags"` + Data telemetryData `json:"data"` +} + +type telemetryData struct { + BaseType string `json:"baseType"` + BaseData map[string]interface{} `json:"baseData"` +} + +// ClientOption configures the TelemetryClient. +type ClientOption func(*TelemetryClient) + +// WithBatchInterval sets the batch interval for sending telemetry. +func WithBatchInterval(interval time.Duration) ClientOption { + return func(c *TelemetryClient) { + c.batchInterval = interval + } +} + +// WithMaxBatchSize sets the maximum batch size. +func WithMaxBatchSize(size int) ClientOption { + return func(c *TelemetryClient) { + c.maxBatchSize = size + } +} + +// WithLogger sets the logger for the telemetry client. +func WithLogger(logger logr.Logger) ClientOption { + return func(c *TelemetryClient) { + c.logger = logger + } +} + +// NewTelemetryClient creates a new TelemetryClient. +func NewTelemetryClient(ctx *OperatorContext, opts ...ClientOption) *TelemetryClient { + client := &TelemetryClient{ + operatorContext: ctx, + enabled: true, + batchInterval: DefaultBatchInterval, + maxBatchSize: DefaultMaxBatchSize, + maxRetries: DefaultMaxRetries, + retryBaseDelay: DefaultRetryBaseDelay, + maxBufferSize: DefaultBufferSize, + eventBuffer: make([]telemetryEnvelope, 0), + localBuffer: make([]telemetryEnvelope, 0), + ingestionEndpoint: AppInsightsTrackEndpoint, + httpClient: &http.Client{ + Timeout: 30 * time.Second, + }, + stopChan: make(chan struct{}), + } + + // Apply options + for _, opt := range opts { + opt(client) + } + + // Check if telemetry is enabled + if enabled := os.Getenv(EnvTelemetryEnabled); enabled == "false" { + client.enabled = false + if client.logger.GetSink() != nil { + client.logger.Info("Telemetry collection is disabled via environment variable") + } + return client + } + + // Get instrumentation key from environment + client.instrumentationKey = os.Getenv(EnvAppInsightsKey) + if client.instrumentationKey == "" { + // Try connection string + connStr := os.Getenv(EnvAppInsightsConnectionString) + client.instrumentationKey = parseInstrumentationKeyFromConnectionString(connStr) + } + + if client.instrumentationKey == "" { + client.enabled = false + if client.logger.GetSink() != nil { + client.logger.Info("No Application Insights instrumentation key found, telemetry disabled") + } + return client + } + + return client +} + +// Start begins the background batch processing goroutine. +func (c *TelemetryClient) Start() { + if !c.enabled { + return + } + + c.wg.Add(1) + go c.batchProcessor() +} + +// Stop gracefully stops the telemetry client and flushes remaining events. +func (c *TelemetryClient) Stop() { + if !c.enabled { + return + } + + close(c.stopChan) + c.wg.Wait() + + // Flush any remaining events + c.flush() +} + +// IsEnabled returns whether telemetry collection is enabled. +func (c *TelemetryClient) IsEnabled() bool { + return c.enabled +} + +// TrackEvent sends a custom event to Application Insights. +func (c *TelemetryClient) TrackEvent(eventName string, properties map[string]interface{}) { + if !c.enabled { + return + } + + envelope := c.createEnvelope("Microsoft.ApplicationInsights.Event", map[string]interface{}{ + "name": eventName, + "properties": c.addContextProperties(properties), + }) + + c.bufferMutex.Lock() + c.eventBuffer = append(c.eventBuffer, envelope) + shouldFlush := len(c.eventBuffer) >= c.maxBatchSize + c.bufferMutex.Unlock() + + if shouldFlush { + go c.flush() + } +} + +// TrackMetric sends a metric to Application Insights. +func (c *TelemetryClient) TrackMetric(metricName string, value float64, properties map[string]interface{}) { + if !c.enabled { + return + } + + envelope := c.createEnvelope("Microsoft.ApplicationInsights.Metric", map[string]interface{}{ + "metrics": []map[string]interface{}{ + { + "name": metricName, + "value": value, + }, + }, + "properties": c.addContextProperties(properties), + }) + + c.bufferMutex.Lock() + c.eventBuffer = append(c.eventBuffer, envelope) + c.bufferMutex.Unlock() +} + +// TrackException sends an exception/error to Application Insights. +func (c *TelemetryClient) TrackException(err error, properties map[string]interface{}) { + if !c.enabled { + return + } + + // Sanitize error message to remove potential PII + sanitizedMessage := sanitizeErrorMessage(err.Error()) + + envelope := c.createEnvelope("Microsoft.ApplicationInsights.Exception", map[string]interface{}{ + "exceptions": []map[string]interface{}{ + { + "message": sanitizedMessage, + }, + }, + "properties": c.addContextProperties(properties), + }) + + c.bufferMutex.Lock() + c.eventBuffer = append(c.eventBuffer, envelope) + c.bufferMutex.Unlock() +} + +// createEnvelope creates a telemetry envelope for Application Insights. +func (c *TelemetryClient) createEnvelope(baseType string, baseData map[string]interface{}) telemetryEnvelope { + return telemetryEnvelope{ + Name: baseType, + Time: time.Now().UTC().Format(time.RFC3339Nano), + IKey: c.instrumentationKey, + Tags: map[string]string{ + "ai.cloud.role": "documentdb-operator", + "ai.cloud.roleInstance": c.operatorContext.OperatorNamespaceHash, + "ai.application.ver": c.operatorContext.OperatorVersion, + }, + Data: telemetryData{ + BaseType: baseType, + BaseData: baseData, + }, + } +} + +// addContextProperties adds operator context to event properties. +func (c *TelemetryClient) addContextProperties(properties map[string]interface{}) map[string]interface{} { + if properties == nil { + properties = make(map[string]interface{}) + } + + // Add operator context (these are added to all events as per spec) + properties["kubernetes_distribution"] = string(c.operatorContext.KubernetesDistribution) + properties["kubernetes_version"] = c.operatorContext.KubernetesVersion + properties["operator_version"] = c.operatorContext.OperatorVersion + + if c.operatorContext.Region != "" { + properties["region"] = c.operatorContext.Region + } + + return properties +} + +// batchProcessor runs in the background to periodically send batched events. +func (c *TelemetryClient) batchProcessor() { + defer c.wg.Done() + + ticker := time.NewTicker(c.batchInterval) + defer ticker.Stop() + + for { + select { + case <-ticker.C: + c.flush() + case <-c.stopChan: + return + } + } +} + +// flush sends all buffered events to Application Insights. +func (c *TelemetryClient) flush() { + c.bufferMutex.Lock() + if len(c.eventBuffer) == 0 { + c.bufferMutex.Unlock() + // Also try to send locally buffered events + c.flushLocalBuffer() + return + } + + events := c.eventBuffer + c.eventBuffer = make([]telemetryEnvelope, 0) + c.bufferMutex.Unlock() + + // Send events with retry + if err := c.sendWithRetry(events); err != nil { + // Store in local buffer if send fails + c.localMutex.Lock() + c.localBuffer = append(c.localBuffer, events...) + // Trim buffer if it exceeds max size + if len(c.localBuffer) > c.maxBufferSize { + c.localBuffer = c.localBuffer[len(c.localBuffer)-c.maxBufferSize:] + } + c.localMutex.Unlock() + + if c.logger.GetSink() != nil { + c.logger.Error(err, "Failed to send telemetry, buffered locally", "eventCount", len(events)) + } + } +} + +// flushLocalBuffer attempts to send locally buffered events. +func (c *TelemetryClient) flushLocalBuffer() { + c.localMutex.Lock() + if len(c.localBuffer) == 0 { + c.localMutex.Unlock() + return + } + + events := c.localBuffer + c.localBuffer = make([]telemetryEnvelope, 0) + c.localMutex.Unlock() + + if err := c.sendWithRetry(events); err != nil { + // Put back in buffer + c.localMutex.Lock() + c.localBuffer = append(events, c.localBuffer...) + if len(c.localBuffer) > c.maxBufferSize { + c.localBuffer = c.localBuffer[:c.maxBufferSize] + } + c.localMutex.Unlock() + } +} + +// sendWithRetry sends events to Application Insights with exponential backoff retry. +func (c *TelemetryClient) sendWithRetry(events []telemetryEnvelope) error { + var lastErr error + + for attempt := 0; attempt < c.maxRetries; attempt++ { + if attempt > 0 { + delay := c.retryBaseDelay * time.Duration(1<= 400 { + return fmt.Errorf("telemetry submission failed with status: %d", resp.StatusCode) + } + + return nil +} + +// parseInstrumentationKeyFromConnectionString extracts the instrumentation key from a connection string. +func parseInstrumentationKeyFromConnectionString(connStr string) string { + if connStr == "" { + return "" + } + + // Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;... + for _, part := range bytes.Split([]byte(connStr), []byte(";")) { + if bytes.HasPrefix(part, []byte("InstrumentationKey=")) { + return string(bytes.TrimPrefix(part, []byte("InstrumentationKey="))) + } + } + + return "" +} + +// sanitizeErrorMessage removes potential PII from error messages. +func sanitizeErrorMessage(msg string) string { + // Basic sanitization - in production, this should be more comprehensive + // Remove potential file paths, IP addresses, etc. + // For now, truncate to reasonable length + const maxLength = 500 + if len(msg) > maxLength { + msg = msg[:maxLength] + "..." + } + return msg +} diff --git a/operator/src/internal/telemetry/events.go b/operator/src/internal/telemetry/events.go new file mode 100644 index 00000000..a1f65ba3 --- /dev/null +++ b/operator/src/internal/telemetry/events.go @@ -0,0 +1,180 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package telemetry + +import ( + "time" +) + +// EventTracker provides high-level methods for tracking telemetry events. +type EventTracker struct { + client *TelemetryClient + guidManager *GUIDManager +} + +// NewEventTracker creates a new EventTracker. +func NewEventTracker(client *TelemetryClient, guidManager *GUIDManager) *EventTracker { + return &EventTracker{ + client: client, + guidManager: guidManager, + } +} + +// TrackOperatorStartup tracks the OperatorStartup event. +func (t *EventTracker) TrackOperatorStartup(event OperatorStartupEvent) { + t.client.TrackEvent("OperatorStartup", map[string]interface{}{ + "operator_version": event.OperatorVersion, + "kubernetes_version": event.KubernetesVersion, + "cloud_provider": event.CloudProvider, + "startup_timestamp": event.StartupTimestamp.Format(time.RFC3339), + "restart_count": event.RestartCount, + "helm_chart_version": event.HelmChartVersion, + }) +} + +// TrackClusterCreated tracks the ClusterCreated event. +func (t *EventTracker) TrackClusterCreated(event ClusterCreatedEvent) { + t.client.TrackEvent("ClusterCreated", map[string]interface{}{ + "cluster_id": event.ClusterID, + "namespace_hash": event.NamespaceHash, + "creation_duration_seconds": event.CreationDurationSeconds, + "node_count": event.NodeCount, + "instances_per_node": event.InstancesPerNode, + "storage_size": event.StorageSize, + "cloud_provider": event.CloudProvider, + "tls_enabled": event.TLSEnabled, + "bootstrap_type": event.BootstrapType, + "sidecar_injector_plugin": event.SidecarInjectorPlugin, + "service_type": event.ServiceType, + }) +} + +// TrackClusterUpdated tracks the ClusterUpdated event. +func (t *EventTracker) TrackClusterUpdated(event ClusterUpdatedEvent) { + t.client.TrackEvent("ClusterUpdated", map[string]interface{}{ + "cluster_id": event.ClusterID, + "namespace_hash": event.NamespaceHash, + "update_type": event.UpdateType, + "update_duration_seconds": event.UpdateDurationSeconds, + }) +} + +// TrackClusterDeleted tracks the ClusterDeleted event. +func (t *EventTracker) TrackClusterDeleted(event ClusterDeletedEvent) { + t.client.TrackEvent("ClusterDeleted", map[string]interface{}{ + "cluster_id": event.ClusterID, + "namespace_hash": event.NamespaceHash, + "deletion_duration_seconds": event.DeletionDurationSeconds, + "cluster_age_days": event.ClusterAgeDays, + "backup_count": event.BackupCount, + }) +} + +// TrackBackupCreated tracks the BackupCreated event. +func (t *EventTracker) TrackBackupCreated(event BackupCreatedEvent) { + t.client.TrackEvent("BackupCreated", map[string]interface{}{ + "backup_id": event.BackupID, + "cluster_id": event.ClusterID, + "namespace_hash": event.NamespaceHash, + "backup_type": event.BackupType, + "backup_method": event.BackupMethod, + "backup_size_bytes": event.BackupSizeBytes, + "backup_duration_seconds": event.BackupDurationSeconds, + "retention_days": event.RetentionDays, + "backup_phase": event.BackupPhase, + "cloud_provider": event.CloudProvider, + "is_primary_cluster": event.IsPrimaryCluster, + }) +} + +// TrackBackupDeleted tracks the BackupDeleted event. +func (t *EventTracker) TrackBackupDeleted(event BackupDeletedEvent) { + t.client.TrackEvent("BackupDeleted", map[string]interface{}{ + "backup_id": event.BackupID, + "deletion_reason": event.DeletionReason, + "backup_age_days": event.BackupAgeDays, + }) +} + +// TrackScheduledBackupCreated tracks the ScheduledBackupCreated event. +func (t *EventTracker) TrackScheduledBackupCreated(event ScheduledBackupCreatedEvent) { + t.client.TrackEvent("ScheduledBackupCreated", map[string]interface{}{ + "scheduled_backup_id": event.ScheduledBackupID, + "cluster_id": event.ClusterID, + "schedule_frequency": event.ScheduleFrequency, + "retention_days": event.RetentionDays, + }) +} + +// TrackClusterRestored tracks the ClusterRestored event. +func (t *EventTracker) TrackClusterRestored(event ClusterRestoredEvent) { + t.client.TrackEvent("ClusterRestored", map[string]interface{}{ + "new_cluster_id": event.NewClusterID, + "source_backup_id": event.SourceBackupID, + "namespace_hash": event.NamespaceHash, + "restore_duration_seconds": event.RestoreDurationSeconds, + "backup_age_hours": event.BackupAgeHours, + "restore_phase": event.RestorePhase, + }) +} + +// TrackFailoverOccurred tracks the FailoverOccurred event. +func (t *EventTracker) TrackFailoverOccurred(event FailoverOccurredEvent) { + t.client.TrackEvent("FailoverOccurred", map[string]interface{}{ + "cluster_id": event.ClusterID, + "namespace_hash": event.NamespaceHash, + "failover_type": event.FailoverType, + "old_primary_index": event.OldPrimaryIndex, + "new_primary_index": event.NewPrimaryIndex, + "failover_duration_seconds": event.FailoverDurationSeconds, + "downtime_seconds": event.DowntimeSeconds, + "replication_lag_bytes": event.ReplicationLagBytes, + "trigger_reason": event.TriggerReason, + }) +} + +// TrackReconciliationError tracks the ReconciliationError event. +func (t *EventTracker) TrackReconciliationError(event ReconciliationErrorEvent) { + t.client.TrackEvent("ReconciliationError", map[string]interface{}{ + "resource_type": event.ResourceType, + "resource_id": event.ResourceID, + "namespace_hash": event.NamespaceHash, + "error_type": event.ErrorType, + "error_message": event.ErrorMessage, + "error_code": event.ErrorCode, + "retry_count": event.RetryCount, + "resolution_status": event.ResolutionStatus, + }) +} + +// TrackVolumeSnapshotError tracks the VolumeSnapshotError event. +func (t *EventTracker) TrackVolumeSnapshotError(event VolumeSnapshotErrorEvent) { + t.client.TrackEvent("VolumeSnapshotError", map[string]interface{}{ + "backup_id": event.BackupID, + "cluster_id": event.ClusterID, + "error_type": event.ErrorType, + "csi_driver_type": event.CSIDriverType, + "cloud_provider": event.CloudProvider, + }) +} + +// TrackCNPGIntegrationError tracks the CNPGIntegrationError event. +func (t *EventTracker) TrackCNPGIntegrationError(event CNPGIntegrationErrorEvent) { + t.client.TrackEvent("CNPGIntegrationError", map[string]interface{}{ + "cluster_id": event.ClusterID, + "cnpg_resource_type": event.CNPGResourceType, + "error_category": event.ErrorCategory, + "operation": event.Operation, + }) +} + +// TrackBackupExpired tracks the BackupExpired event. +func (t *EventTracker) TrackBackupExpired(event BackupExpiredEvent) { + t.client.TrackEvent("BackupExpired", map[string]interface{}{ + "backup_id": event.BackupID, + "cluster_id": event.ClusterID, + "retention_days": event.RetentionDays, + "actual_age_days": event.ActualAgeDays, + }) +} diff --git a/operator/src/internal/telemetry/guid.go b/operator/src/internal/telemetry/guid.go new file mode 100644 index 00000000..21310a75 --- /dev/null +++ b/operator/src/internal/telemetry/guid.go @@ -0,0 +1,113 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package telemetry + +import ( + "context" + + "github.com/google/uuid" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "sigs.k8s.io/controller-runtime/pkg/client" +) + +// GUIDManager handles generation and retrieval of telemetry GUIDs. +type GUIDManager struct { + client client.Client +} + +// NewGUIDManager creates a new GUIDManager. +func NewGUIDManager(c client.Client) *GUIDManager { + return &GUIDManager{client: c} +} + +// GetOrCreateClusterID retrieves or creates a telemetry GUID for a cluster. +func (m *GUIDManager) GetOrCreateClusterID(ctx context.Context, obj client.Object) (string, error) { + return m.getOrCreateID(ctx, obj, ClusterIDAnnotation) +} + +// GetOrCreateBackupID retrieves or creates a telemetry GUID for a backup. +func (m *GUIDManager) GetOrCreateBackupID(ctx context.Context, obj client.Object) (string, error) { + return m.getOrCreateID(ctx, obj, BackupIDAnnotation) +} + +// GetOrCreateScheduledBackupID retrieves or creates a telemetry GUID for a scheduled backup. +func (m *GUIDManager) GetOrCreateScheduledBackupID(ctx context.Context, obj client.Object) (string, error) { + return m.getOrCreateID(ctx, obj, ScheduledBackupIDAnnotation) +} + +// GetClusterID retrieves the telemetry GUID for a cluster without creating one. +func (m *GUIDManager) GetClusterID(obj client.Object) string { + return getAnnotation(obj, ClusterIDAnnotation) +} + +// GetBackupID retrieves the telemetry GUID for a backup without creating one. +func (m *GUIDManager) GetBackupID(obj client.Object) string { + return getAnnotation(obj, BackupIDAnnotation) +} + +// getOrCreateID retrieves or creates a GUID in the specified annotation. +func (m *GUIDManager) getOrCreateID(ctx context.Context, obj client.Object, annotationKey string) (string, error) { + // Check if ID already exists + existingID := getAnnotation(obj, annotationKey) + if existingID != "" { + return existingID, nil + } + + // Generate new UUID + newID := uuid.New().String() + + // Update the object with the new annotation + annotations := obj.GetAnnotations() + if annotations == nil { + annotations = make(map[string]string) + } + annotations[annotationKey] = newID + obj.SetAnnotations(annotations) + + // Persist the annotation + if m.client != nil { + if err := m.client.Update(ctx, obj); err != nil { + return newID, err + } + } + + return newID, nil +} + +// SetClusterID sets a telemetry GUID for a cluster (without persisting). +// Useful when creating new resources. +func SetClusterID(obj metav1.Object) string { + return setAnnotation(obj, ClusterIDAnnotation) +} + +// SetBackupID sets a telemetry GUID for a backup (without persisting). +func SetBackupID(obj metav1.Object) string { + return setAnnotation(obj, BackupIDAnnotation) +} + +// SetScheduledBackupID sets a telemetry GUID for a scheduled backup (without persisting). +func SetScheduledBackupID(obj metav1.Object) string { + return setAnnotation(obj, ScheduledBackupIDAnnotation) +} + +// getAnnotation safely retrieves an annotation value. +func getAnnotation(obj client.Object, key string) string { + annotations := obj.GetAnnotations() + if annotations == nil { + return "" + } + return annotations[key] +} + +// setAnnotation sets a new UUID in an annotation and returns it. +func setAnnotation(obj metav1.Object, key string) string { + newID := uuid.New().String() + annotations := obj.GetAnnotations() + if annotations == nil { + annotations = make(map[string]string) + } + annotations[key] = newID + obj.SetAnnotations(annotations) + return newID +} diff --git a/operator/src/internal/telemetry/manager.go b/operator/src/internal/telemetry/manager.go new file mode 100644 index 00000000..4e2fa8a8 --- /dev/null +++ b/operator/src/internal/telemetry/manager.go @@ -0,0 +1,241 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package telemetry + +import ( + "context" + "os" + "time" + + "github.com/go-logr/logr" + metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" + "k8s.io/client-go/kubernetes" + "sigs.k8s.io/controller-runtime/pkg/client" +) + +// Manager is the main entry point for telemetry operations. +type Manager struct { + Client *TelemetryClient + Events *EventTracker + Metrics *MetricsTracker + GUIDs *GUIDManager + operatorCtx *OperatorContext + logger logr.Logger +} + +// ManagerConfig contains configuration for the telemetry manager. +type ManagerConfig struct { + OperatorVersion string + HelmChartVersion string + Logger logr.Logger +} + +// NewManager creates a new telemetry Manager. +func NewManager(ctx context.Context, cfg ManagerConfig, k8sClient client.Client, clientset kubernetes.Interface) (*Manager, error) { + // Detect operator context + operatorCtx, err := detectOperatorContext(ctx, cfg, clientset) + if err != nil { + cfg.Logger.Error(err, "Failed to detect operator context, using defaults") + operatorCtx = &OperatorContext{ + OperatorVersion: cfg.OperatorVersion, + KubernetesDistribution: DistributionOther, + CloudProvider: CloudProviderUnknown, + StartupTimestamp: time.Now(), + } + } + + // Create telemetry client + telemetryClient := NewTelemetryClient(operatorCtx, WithLogger(cfg.Logger)) + + // Create GUID manager + guidManager := NewGUIDManager(k8sClient) + + // Create event and metrics trackers + eventTracker := NewEventTracker(telemetryClient, guidManager) + metricsTracker := NewMetricsTracker(telemetryClient) + + return &Manager{ + Client: telemetryClient, + Events: eventTracker, + Metrics: metricsTracker, + GUIDs: guidManager, + operatorCtx: operatorCtx, + logger: cfg.Logger, + }, nil +} + +// Start begins telemetry collection. +func (m *Manager) Start() { + m.Client.Start() + + // Send operator startup event + m.Events.TrackOperatorStartup(OperatorStartupEvent{ + OperatorVersion: m.operatorCtx.OperatorVersion, + KubernetesVersion: m.operatorCtx.KubernetesVersion, + CloudProvider: string(m.operatorCtx.CloudProvider), + StartupTimestamp: m.operatorCtx.StartupTimestamp, + RestartCount: getRestartCount(), + HelmChartVersion: m.operatorCtx.HelmChartVersion, + }) + + m.logger.Info("Telemetry collection started", + "enabled", m.Client.IsEnabled(), + "operatorVersion", m.operatorCtx.OperatorVersion, + "k8sVersion", m.operatorCtx.KubernetesVersion, + ) +} + +// Stop gracefully stops telemetry collection. +func (m *Manager) Stop() { + m.Client.Stop() + m.logger.Info("Telemetry collection stopped") +} + +// IsEnabled returns whether telemetry is enabled. +func (m *Manager) IsEnabled() bool { + return m.Client.IsEnabled() +} + +// GetOperatorContext returns the detected operator context. +func (m *Manager) GetOperatorContext() *OperatorContext { + return m.operatorCtx +} + +// detectOperatorContext detects the deployment environment. +func detectOperatorContext(ctx context.Context, cfg ManagerConfig, clientset kubernetes.Interface) (*OperatorContext, error) { + opCtx := &OperatorContext{ + OperatorVersion: cfg.OperatorVersion, + HelmChartVersion: cfg.HelmChartVersion, + StartupTimestamp: time.Now(), + } + + // Get Kubernetes version + if clientset != nil { + discoveryClient := clientset.Discovery() + if serverVersion, err := discoveryClient.ServerVersion(); err == nil { + opCtx.KubernetesVersion = serverVersion.GitVersion + opCtx.KubernetesDistribution = DetectKubernetesDistribution(serverVersion.GitVersion) + } + } + + // Detect cloud provider from environment or node labels + opCtx.CloudProvider = detectCloudProvider(ctx, clientset) + + // Get operator namespace (hashed) + if ns := os.Getenv("POD_NAMESPACE"); ns != "" { + opCtx.OperatorNamespaceHash = HashNamespace(ns) + } + + // Detect region from node labels if possible + opCtx.Region = detectRegion(ctx, clientset) + + // Detect installation method + opCtx.InstallationMethod = detectInstallationMethod() + + return opCtx, nil +} + +// detectCloudProvider attempts to detect the cloud provider. +func detectCloudProvider(ctx context.Context, clientset kubernetes.Interface) CloudProvider { + if clientset == nil { + return CloudProviderUnknown + } + + // Try to detect from node labels or provider ID + nodes, err := clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{Limit: 1}) + if err != nil || len(nodes.Items) == 0 { + return CloudProviderUnknown + } + + node := nodes.Items[0] + + // Check provider ID + providerID := node.Spec.ProviderID + switch { + case containsAny(providerID, "azure", "aks"): + return CloudProviderAKS + case containsAny(providerID, "aws", "eks"): + return CloudProviderEKS + case containsAny(providerID, "gce", "gke"): + return CloudProviderGKE + } + + // Check node labels + labels := node.Labels + if labels != nil { + if _, ok := labels["kubernetes.azure.com/cluster"]; ok { + return CloudProviderAKS + } + if _, ok := labels["eks.amazonaws.com/nodegroup"]; ok { + return CloudProviderEKS + } + if _, ok := labels["cloud.google.com/gke-nodepool"]; ok { + return CloudProviderGKE + } + } + + return CloudProviderUnknown +} + +// detectRegion attempts to detect the cloud region. +func detectRegion(ctx context.Context, clientset kubernetes.Interface) string { + if clientset == nil { + return "" + } + + nodes, err := clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{Limit: 1}) + if err != nil || len(nodes.Items) == 0 { + return "" + } + + node := nodes.Items[0] + if labels := node.Labels; labels != nil { + // Standard Kubernetes topology label + if region, ok := labels["topology.kubernetes.io/region"]; ok { + return region + } + // Fallback to failure-domain label (deprecated but still used) + if region, ok := labels["failure-domain.beta.kubernetes.io/region"]; ok { + return region + } + } + + return "" +} + +// detectInstallationMethod attempts to detect how the operator was installed. +func detectInstallationMethod() string { + // Check for Helm-specific annotations/labels + if os.Getenv("HELM_RELEASE_NAME") != "" { + return "helm" + } + + // Check for OLM (Operator Lifecycle Manager) + if os.Getenv("OPERATOR_CONDITION_NAME") != "" { + return "operator-sdk" + } + + return "kubectl" +} + +// getRestartCount returns the restart count (simplified implementation). +func getRestartCount() int { + // In a real implementation, this would track restarts + // For now, return 0 as this is initial startup + return 0 +} + +// containsAny checks if s contains any of the substrings. +func containsAny(s string, substrings ...string) bool { + for _, sub := range substrings { + if len(s) > 0 && len(sub) > 0 { + for i := 0; i <= len(s)-len(sub); i++ { + if s[i:i+len(sub)] == sub { + return true + } + } + } + } + return false +} diff --git a/operator/src/internal/telemetry/metrics.go b/operator/src/internal/telemetry/metrics.go new file mode 100644 index 00000000..990c3745 --- /dev/null +++ b/operator/src/internal/telemetry/metrics.go @@ -0,0 +1,157 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package telemetry + +// MetricsTracker provides high-level methods for tracking telemetry metrics. +type MetricsTracker struct { + client *TelemetryClient +} + +// NewMetricsTracker creates a new MetricsTracker. +func NewMetricsTracker(client *TelemetryClient) *MetricsTracker { + return &MetricsTracker{ + client: client, + } +} + +// TrackOperatorHealthStatus tracks the operator health status metric. +func (m *MetricsTracker) TrackOperatorHealthStatus(healthy bool, podName, namespaceHash string) { + value := 0.0 + if healthy { + value = 1.0 + } + m.client.TrackMetric("operator.health.status", value, map[string]interface{}{ + "pod_name": podName, + "namespace_hash": namespaceHash, + }) +} + +// TrackActiveClustersCount tracks the number of active DocumentDB clusters. +func (m *MetricsTracker) TrackActiveClustersCount(count int, namespaceHash, cloudProvider, environment string) { + m.client.TrackMetric("documentdb.clusters.active.count", float64(count), map[string]interface{}{ + "namespace_hash": namespaceHash, + "cloud_provider": cloudProvider, + "environment": environment, + }) +} + +// TrackClusterConfiguration tracks cluster configuration metrics. +func (m *MetricsTracker) TrackClusterConfiguration(metric ClusterConfigurationMetric) { + m.client.TrackMetric("documentdb.cluster.configuration", 1, map[string]interface{}{ + "cluster_id": metric.ClusterID, + "namespace_hash": metric.NamespaceHash, + "node_count": metric.NodeCount, + "instances_per_node": metric.InstancesPerNode, + "total_instances": metric.TotalInstances, + "pvc_size_category": string(metric.PVCSizeCategory), + "documentdb_version": metric.DocumentDBVersion, + }) +} + +// TrackReplicationEnabled tracks replication configuration metrics. +func (m *MetricsTracker) TrackReplicationEnabled(enabled bool, metric ReplicationEnabledMetric) { + value := 0.0 + if enabled { + value = 1.0 + } + m.client.TrackMetric("documentdb.cluster.replication.enabled", value, map[string]interface{}{ + "cluster_id": metric.ClusterID, + "cross_cloud_networking_strategy": metric.CrossCloudNetworkingStrategy, + "primary_cluster_id": metric.PrimaryClusterID, + "replica_count": metric.ReplicaCount, + "high_availability": metric.HighAvailability, + "participating_cluster_count": metric.ParticipatingClusterCount, + "environments": metric.Environments, + }) +} + +// TrackActiveBackupsCount tracks the number of active backups. +func (m *MetricsTracker) TrackActiveBackupsCount(count int, namespaceHash, clusterID, backupType string) { + m.client.TrackMetric("documentdb.backups.active.count", float64(count), map[string]interface{}{ + "namespace_hash": namespaceHash, + "cluster_id": clusterID, + "backup_type": backupType, + }) +} + +// TrackScheduledBackupsCount tracks the number of active scheduled backup jobs. +func (m *MetricsTracker) TrackScheduledBackupsCount(count int) { + m.client.TrackMetric("documentdb.scheduled_backups.active.count", float64(count), nil) +} + +// TrackReplicationLag tracks replication lag metrics. +func (m *MetricsTracker) TrackReplicationLag(metric ReplicationLagMetric) { + m.client.TrackMetric("documentdb.replication.lag.bytes", float64(metric.AvgLagBytes), map[string]interface{}{ + "cluster_id": metric.ClusterID, + "replica_cluster_id": metric.ReplicaClusterID, + "namespace_hash": metric.NamespaceHash, + "min_lag_bytes": metric.MinLagBytes, + "max_lag_bytes": metric.MaxLagBytes, + "avg_lag_bytes": metric.AvgLagBytes, + }) +} + +// TrackReplicationStatus tracks replication health status. +func (m *MetricsTracker) TrackReplicationStatus(healthy bool, clusterID, replicaClusterID, namespaceHash string) { + value := 0.0 + if healthy { + value = 1.0 + } + m.client.TrackMetric("documentdb.replication.status", value, map[string]interface{}{ + "cluster_id": clusterID, + "replica_cluster_id": replicaClusterID, + "namespace_hash": namespaceHash, + }) +} + +// TrackTLSEnabledCount tracks the number of clusters with TLS enabled. +func (m *MetricsTracker) TrackTLSEnabledCount(count int, tlsMode string, serverEnabled, clientEnabled bool) { + m.client.TrackMetric("documentdb.tls.enabled.count", float64(count), map[string]interface{}{ + "tls_mode": tlsMode, + "server_tls_enabled": serverEnabled, + "client_tls_enabled": clientEnabled, + }) +} + +// TrackServiceExposureCount tracks service exposure methods. +func (m *MetricsTracker) TrackServiceExposureCount(count int, serviceType, cloudProvider string) { + m.client.TrackMetric("documentdb.service_exposure.count", float64(count), map[string]interface{}{ + "service_type": serviceType, + "cloud_provider": cloudProvider, + }) +} + +// TrackPluginUsageCount tracks plugin usage. +func (m *MetricsTracker) TrackPluginUsageCount(sidecarInjectorEnabled, walReplicaEnabled bool) { + m.client.TrackMetric("documentdb.plugin.usage.count", 1, map[string]interface{}{ + "sidecar_injector_plugin_enabled": sidecarInjectorEnabled, + "wal_replica_plugin_enabled": walReplicaEnabled, + }) +} + +// TrackReconciliationDuration tracks reconciliation performance. +func (m *MetricsTracker) TrackReconciliationDuration(metric ReconciliationDurationMetric) { + m.client.TrackMetric("documentdb.reconciliation.duration.seconds", metric.DurationSeconds, map[string]interface{}{ + "resource_type": metric.ResourceType, + "operation": metric.Operation, + "status": metric.Status, + }) +} + +// TrackAPICallDuration tracks Kubernetes API call latency. +func (m *MetricsTracker) TrackAPICallDuration(durationSeconds float64, operation, resourceType, result string) { + m.client.TrackMetric("documentdb.api.duration.seconds", durationSeconds, map[string]interface{}{ + "operation": operation, + "resource_type": resourceType, + "result": result, + }) +} + +// TrackBackupRetentionDays tracks backup retention policy. +func (m *MetricsTracker) TrackBackupRetentionDays(retentionDays int, clusterID, policyLevel string) { + m.client.TrackMetric("documentdb.backup.retention.days", float64(retentionDays), map[string]interface{}{ + "cluster_id": clusterID, + "policy_level": policyLevel, + }) +} diff --git a/operator/src/internal/telemetry/types.go b/operator/src/internal/telemetry/types.go new file mode 100644 index 00000000..ac756eaf --- /dev/null +++ b/operator/src/internal/telemetry/types.go @@ -0,0 +1,247 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +// Package telemetry provides Application Insights integration for the DocumentDB Kubernetes Operator. +// It implements telemetry collection as specified in docs/designs/appinsights-metrics.md. +package telemetry + +import ( + "time" +) + +// TelemetryAnnotations defines the annotation keys used for telemetry correlation. +const ( + // ClusterIDAnnotation is the annotation key for storing auto-generated cluster GUID. + ClusterIDAnnotation = "telemetry.documentdb.io/cluster-id" + // BackupIDAnnotation is the annotation key for storing auto-generated backup GUID. + BackupIDAnnotation = "telemetry.documentdb.io/backup-id" + // ScheduledBackupIDAnnotation is the annotation key for storing auto-generated scheduled backup GUID. + ScheduledBackupIDAnnotation = "telemetry.documentdb.io/scheduled-backup-id" +) + +// CloudProvider represents the detected cloud environment. +type CloudProvider string + +const ( + CloudProviderAKS CloudProvider = "aks" + CloudProviderEKS CloudProvider = "eks" + CloudProviderGKE CloudProvider = "gke" + CloudProviderUnknown CloudProvider = "unknown" +) + +// KubernetesDistribution represents the detected Kubernetes distribution. +type KubernetesDistribution string + +const ( + DistributionAKS KubernetesDistribution = "aks" + DistributionEKS KubernetesDistribution = "eks" + DistributionGKE KubernetesDistribution = "gke" + DistributionOpenShift KubernetesDistribution = "openshift" + DistributionRancher KubernetesDistribution = "rancher" + DistributionVMwareTanzu KubernetesDistribution = "vmware-tanzu" + DistributionOther KubernetesDistribution = "other" +) + +// PVCSizeCategory categorizes PVC sizes without exposing exact values. +type PVCSizeCategory string + +const ( + PVCSizeSmall PVCSizeCategory = "small" // <50Gi + PVCSizeMedium PVCSizeCategory = "medium" // 50-200Gi + PVCSizeLarge PVCSizeCategory = "large" // >200Gi +) + +// ScheduleFrequency categorizes backup schedule frequency. +type ScheduleFrequency string + +const ( + ScheduleFrequencyHourly ScheduleFrequency = "hourly" + ScheduleFrequencyDaily ScheduleFrequency = "daily" + ScheduleFrequencyWeekly ScheduleFrequency = "weekly" + ScheduleFrequencyCustom ScheduleFrequency = "custom" +) + +// OperatorContext contains deployment context collected at startup. +type OperatorContext struct { + OperatorVersion string + KubernetesVersion string + KubernetesDistribution KubernetesDistribution + CloudProvider CloudProvider + Region string + OperatorNamespaceHash string + InstallationMethod string + HelmChartVersion string + StartupTimestamp time.Time +} + +// OperatorStartupEvent represents the OperatorStartup telemetry event. +type OperatorStartupEvent struct { + OperatorVersion string `json:"operator_version"` + KubernetesVersion string `json:"kubernetes_version"` + CloudProvider string `json:"cloud_provider"` + StartupTimestamp time.Time `json:"startup_timestamp"` + RestartCount int `json:"restart_count"` + HelmChartVersion string `json:"helm_chart_version,omitempty"` +} + +// ClusterCreatedEvent represents the ClusterCreated telemetry event. +type ClusterCreatedEvent struct { + ClusterID string `json:"cluster_id"` + NamespaceHash string `json:"namespace_hash"` + CreationDurationSeconds float64 `json:"creation_duration_seconds"` + NodeCount int `json:"node_count"` + InstancesPerNode int `json:"instances_per_node"` + StorageSize string `json:"storage_size"` + CloudProvider string `json:"cloud_provider"` + TLSEnabled bool `json:"tls_enabled"` + BootstrapType string `json:"bootstrap_type"` + SidecarInjectorPlugin bool `json:"sidecar_injector_plugin"` + ServiceType string `json:"service_type"` +} + +// ClusterUpdatedEvent represents the ClusterUpdated telemetry event. +type ClusterUpdatedEvent struct { + ClusterID string `json:"cluster_id"` + NamespaceHash string `json:"namespace_hash"` + UpdateType string `json:"update_type"` + UpdateDurationSeconds float64 `json:"update_duration_seconds"` +} + +// ClusterDeletedEvent represents the ClusterDeleted telemetry event. +type ClusterDeletedEvent struct { + ClusterID string `json:"cluster_id"` + NamespaceHash string `json:"namespace_hash"` + DeletionDurationSeconds float64 `json:"deletion_duration_seconds"` + ClusterAgeDays int `json:"cluster_age_days"` + BackupCount int `json:"backup_count"` +} + +// BackupCreatedEvent represents the BackupCreated telemetry event. +type BackupCreatedEvent struct { + BackupID string `json:"backup_id"` + ClusterID string `json:"cluster_id"` + NamespaceHash string `json:"namespace_hash"` + BackupType string `json:"backup_type"` // on-demand or scheduled + BackupMethod string `json:"backup_method"` + BackupSizeBytes int64 `json:"backup_size_bytes"` + BackupDurationSeconds float64 `json:"backup_duration_seconds"` + RetentionDays int `json:"retention_days"` + BackupPhase string `json:"backup_phase"` + CloudProvider string `json:"cloud_provider"` + IsPrimaryCluster bool `json:"is_primary_cluster"` +} + +// BackupDeletedEvent represents the BackupDeleted telemetry event. +type BackupDeletedEvent struct { + BackupID string `json:"backup_id"` + DeletionReason string `json:"deletion_reason"` // expired, manual, cluster-deleted + BackupAgeDays int `json:"backup_age_days"` +} + +// ScheduledBackupCreatedEvent represents the ScheduledBackupCreated telemetry event. +type ScheduledBackupCreatedEvent struct { + ScheduledBackupID string `json:"scheduled_backup_id"` + ClusterID string `json:"cluster_id"` + ScheduleFrequency string `json:"schedule_frequency"` + RetentionDays int `json:"retention_days"` +} + +// ClusterRestoredEvent represents the ClusterRestored telemetry event. +type ClusterRestoredEvent struct { + NewClusterID string `json:"new_cluster_id"` + SourceBackupID string `json:"source_backup_id"` + NamespaceHash string `json:"namespace_hash"` + RestoreDurationSeconds float64 `json:"restore_duration_seconds"` + BackupAgeHours float64 `json:"backup_age_hours"` + RestorePhase string `json:"restore_phase"` +} + +// FailoverOccurredEvent represents the FailoverOccurred telemetry event. +type FailoverOccurredEvent struct { + ClusterID string `json:"cluster_id"` + NamespaceHash string `json:"namespace_hash"` + FailoverType string `json:"failover_type"` // automatic, manual, switchover + OldPrimaryIndex int `json:"old_primary_index"` + NewPrimaryIndex int `json:"new_primary_index"` + FailoverDurationSeconds float64 `json:"failover_duration_seconds"` + DowntimeSeconds float64 `json:"downtime_seconds"` + ReplicationLagBytes int64 `json:"replication_lag_bytes"` + TriggerReason string `json:"trigger_reason"` +} + +// ReconciliationErrorEvent represents the ReconciliationError telemetry event. +type ReconciliationErrorEvent struct { + ResourceType string `json:"resource_type"` // DocumentDB, Backup, ScheduledBackup + ResourceID string `json:"resource_id"` + NamespaceHash string `json:"namespace_hash"` + ErrorType string `json:"error_type"` + ErrorMessage string `json:"error_message"` // Sanitized, no PII + ErrorCode string `json:"error_code"` + RetryCount int `json:"retry_count"` + ResolutionStatus string `json:"resolution_status"` // pending, resolved, failed +} + +// VolumeSnapshotErrorEvent represents the VolumeSnapshotError telemetry event. +type VolumeSnapshotErrorEvent struct { + BackupID string `json:"backup_id"` + ClusterID string `json:"cluster_id"` + ErrorType string `json:"error_type"` + CSIDriverType string `json:"csi_driver_type"` + CloudProvider string `json:"cloud_provider"` +} + +// CNPGIntegrationErrorEvent represents the CNPGIntegrationError telemetry event. +type CNPGIntegrationErrorEvent struct { + ClusterID string `json:"cluster_id"` + CNPGResourceType string `json:"cnpg_resource_type"` + ErrorCategory string `json:"error_category"` + Operation string `json:"operation"` +} + +// BackupExpiredEvent represents the BackupExpired telemetry event. +type BackupExpiredEvent struct { + BackupID string `json:"backup_id"` + ClusterID string `json:"cluster_id"` + RetentionDays int `json:"retention_days"` + ActualAgeDays int `json:"actual_age_days"` +} + +// ClusterConfigurationMetric represents cluster configuration metrics. +type ClusterConfigurationMetric struct { + ClusterID string `json:"cluster_id"` + NamespaceHash string `json:"namespace_hash"` + NodeCount int `json:"node_count"` + InstancesPerNode int `json:"instances_per_node"` + TotalInstances int `json:"total_instances"` + PVCSizeCategory PVCSizeCategory `json:"pvc_size_category"` + DocumentDBVersion string `json:"documentdb_version"` +} + +// ReplicationEnabledMetric represents replication configuration metrics. +type ReplicationEnabledMetric struct { + ClusterID string `json:"cluster_id"` + CrossCloudNetworkingStrategy string `json:"cross_cloud_networking_strategy"` + PrimaryClusterID string `json:"primary_cluster_id"` + ReplicaCount int `json:"replica_count"` + HighAvailability bool `json:"high_availability"` + ParticipatingClusterCount int `json:"participating_cluster_count"` + Environments string `json:"environments"` +} + +// ReplicationLagMetric represents replication lag metrics (aggregated). +type ReplicationLagMetric struct { + ClusterID string `json:"cluster_id"` + ReplicaClusterID string `json:"replica_cluster_id"` + NamespaceHash string `json:"namespace_hash"` + MinLagBytes int64 `json:"min_lag_bytes"` + MaxLagBytes int64 `json:"max_lag_bytes"` + AvgLagBytes int64 `json:"avg_lag_bytes"` +} + +// ReconciliationDurationMetric represents reconciliation performance metrics. +type ReconciliationDurationMetric struct { + ResourceType string `json:"resource_type"` + Operation string `json:"operation"` + Status string `json:"status"` + DurationSeconds float64 `json:"duration_seconds"` +} diff --git a/operator/src/internal/telemetry/utils.go b/operator/src/internal/telemetry/utils.go new file mode 100644 index 00000000..2c95c533 --- /dev/null +++ b/operator/src/internal/telemetry/utils.go @@ -0,0 +1,124 @@ +// Copyright (c) Microsoft Corporation. +// Licensed under the MIT License. + +package telemetry + +import ( + "crypto/sha256" + "encoding/hex" + "strings" + + "k8s.io/apimachinery/pkg/api/resource" +) + +// HashNamespace creates a SHA-256 hash of a namespace name for privacy. +func HashNamespace(namespace string) string { + hash := sha256.Sum256([]byte(namespace)) + return hex.EncodeToString(hash[:]) +} + +// CategorizePVCSize categorizes a PVC size string into small/medium/large. +func CategorizePVCSize(pvcSize string) PVCSizeCategory { + if pvcSize == "" { + return PVCSizeSmall + } + + quantity, err := resource.ParseQuantity(pvcSize) + if err != nil { + return PVCSizeSmall + } + + // Convert to Gi for comparison + sizeGi := quantity.Value() / (1024 * 1024 * 1024) + + switch { + case sizeGi < 50: + return PVCSizeSmall + case sizeGi <= 200: + return PVCSizeMedium + default: + return PVCSizeLarge + } +} + +// CategorizeScheduleFrequency categorizes a cron expression into frequency categories. +func CategorizeScheduleFrequency(cronExpr string) ScheduleFrequency { + if cronExpr == "" { + return ScheduleFrequencyCustom + } + + parts := strings.Fields(cronExpr) + if len(parts) < 5 { + return ScheduleFrequencyCustom + } + + // Simple heuristics for common patterns + minute, hour, dayOfMonth, _, dayOfWeek := parts[0], parts[1], parts[2], parts[3], parts[4] + + // Hourly: runs every hour (e.g., "0 * * * *") + if minute != "*" && hour == "*" && dayOfMonth == "*" && dayOfWeek == "*" { + return ScheduleFrequencyHourly + } + + // Daily: runs once per day (e.g., "0 2 * * *") + if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek == "*" { + return ScheduleFrequencyDaily + } + + // Weekly: runs once per week (e.g., "0 2 * * 0") + if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek != "*" { + return ScheduleFrequencyWeekly + } + + return ScheduleFrequencyCustom +} + +// CategorizeCSIDriver categorizes a CSI driver name. +func CategorizeCSIDriver(driverName string) string { + switch { + case strings.Contains(driverName, "azure") || strings.Contains(driverName, "disk.csi.azure.com"): + return "azure-disk" + case strings.Contains(driverName, "aws") || strings.Contains(driverName, "ebs.csi.aws.com"): + return "aws-ebs" + case strings.Contains(driverName, "gce") || strings.Contains(driverName, "pd.csi.storage.gke.io"): + return "gce-pd" + default: + return "other" + } +} + +// MapCloudProviderToString converts CloudProvider to string. +func MapCloudProviderToString(env string) string { + switch strings.ToLower(env) { + case "aks": + return "aks" + case "eks": + return "eks" + case "gke": + return "gke" + default: + return "unknown" + } +} + +// DetectKubernetesDistribution detects the Kubernetes distribution from version info. +func DetectKubernetesDistribution(versionInfo string) KubernetesDistribution { + versionLower := strings.ToLower(versionInfo) + + switch { + case strings.Contains(versionLower, "eks"): + return DistributionEKS + case strings.Contains(versionLower, "aks") || strings.Contains(versionLower, "azure"): + return DistributionAKS + case strings.Contains(versionLower, "gke"): + return DistributionGKE + case strings.Contains(versionLower, "openshift"): + return DistributionOpenShift + case strings.Contains(versionLower, "rancher") || strings.Contains(versionLower, "rke"): + return DistributionRancher + case strings.Contains(versionLower, "tanzu") || strings.Contains(versionLower, "vmware"): + return DistributionVMwareTanzu + default: + return DistributionOther + } +} From 453da4e8bed19e324fcc1330245c367d875e49eb Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Thu, 12 Feb 2026 13:03:02 -0500 Subject: [PATCH 4/7] feat: Add Application Insights telemetry integration - Integrate official Microsoft Application Insights Go SDK (v0.4.4) - Add telemetry tracking for cluster lifecycle events (create, update) - Add telemetry tracking for backup events (create, delete, scheduled) - Add reconciliation duration metrics - Add RBAC for node read permissions (cloud provider detection) - Add Helm chart configuration for telemetry (values.yaml, deployment) - Support connection string, instrumentation key, and existing secret --- .../templates/05_clusterrole.yaml | 4 + .../templates/09_documentdb_operator.yaml | 21 + operator/documentdb-helm-chart/values.yaml | 12 + operator/src/go.mod | 7 +- operator/src/go.sum | 26 +- .../internal/controller/backup_controller.go | 65 +++ .../controller/documentdb_controller.go | 25 + .../controller/scheduledbackup_controller.go | 50 ++ operator/src/internal/telemetry/client.go | 426 +++++------------- 9 files changed, 318 insertions(+), 318 deletions(-) diff --git a/operator/documentdb-helm-chart/templates/05_clusterrole.yaml b/operator/documentdb-helm-chart/templates/05_clusterrole.yaml index bc8393e0..47618a2a 100644 --- a/operator/documentdb-helm-chart/templates/05_clusterrole.yaml +++ b/operator/documentdb-helm-chart/templates/05_clusterrole.yaml @@ -56,3 +56,7 @@ rules: - apiGroups: ["snapshot.storage.k8s.io"] resources: ["volumesnapshotclasses"] verbs: ["get", "list", "watch", "create", "update", "patch", "delete"] +# Node read permissions for telemetry cloud provider detection +- apiGroups: [""] + resources: ["nodes"] + verbs: ["get", "list"] diff --git a/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml b/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml index cdf0022d..22125dee 100644 --- a/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml +++ b/operator/documentdb-helm-chart/templates/09_documentdb_operator.yaml @@ -24,7 +24,28 @@ spec: env: - name: GATEWAY_PORT value: "10260" + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace {{- if .Values.documentDbVersion | default .Chart.AppVersion }} - name: DOCUMENTDB_VERSION value: "{{ .Values.documentDbVersion | default .Chart.AppVersion }}" {{- end }} + # Telemetry configuration + {{- if not .Values.telemetry.enabled }} + - name: DOCUMENTDB_TELEMETRY_ENABLED + value: "false" + {{- else }} + {{- if .Values.telemetry.existingSecret }} + envFrom: + - secretRef: + name: {{ .Values.telemetry.existingSecret }} + {{- else if .Values.telemetry.connectionString }} + - name: APPLICATIONINSIGHTS_CONNECTION_STRING + value: {{ .Values.telemetry.connectionString | quote }} + {{- else if .Values.telemetry.instrumentationKey }} + - name: APPINSIGHTS_INSTRUMENTATIONKEY + value: {{ .Values.telemetry.instrumentationKey | quote }} + {{- end }} + {{- end }} diff --git a/operator/documentdb-helm-chart/values.yaml b/operator/documentdb-helm-chart/values.yaml index 65c3628b..1c70b65a 100644 --- a/operator/documentdb-helm-chart/values.yaml +++ b/operator/documentdb-helm-chart/values.yaml @@ -6,6 +6,18 @@ replicaCount: 1 # Defaults to Chart.appVersion if not specified documentDbVersion: "" +# Telemetry configuration for Application Insights +telemetry: + # Enable or disable telemetry collection + enabled: true + # Application Insights instrumentation key (provide either this or connectionString) + instrumentationKey: "" + # Application Insights connection string (alternative to instrumentationKey) + connectionString: "" + # Name of existing secret containing telemetry credentials + # Secret should have keys: APPINSIGHTS_INSTRUMENTATIONKEY or APPLICATIONINSIGHTS_CONNECTION_STRING + existingSecret: "" + serviceAccount: create: true automount: true diff --git a/operator/src/go.mod b/operator/src/go.mod index b7c4396d..195af065 100644 --- a/operator/src/go.mod +++ b/operator/src/go.mod @@ -10,6 +10,7 @@ require ( github.com/cloudnative-pg/machinery v0.1.0 github.com/go-logr/logr v1.4.2 github.com/google/uuid v1.6.0 + github.com/microsoft/ApplicationInsights-Go v0.4.4 github.com/onsi/ginkgo/v2 v2.22.2 github.com/onsi/gomega v1.36.2 github.com/stretchr/testify v1.11.1 @@ -21,6 +22,11 @@ require ( sigs.k8s.io/controller-runtime v0.20.4 ) +require ( + code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c // indirect + github.com/gofrs/uuid v3.3.0+incompatible // indirect +) + require ( cel.dev/expr v0.19.0 // indirect github.com/antlr4-go/antlr/v4 v4.13.0 // indirect @@ -50,7 +56,6 @@ require ( github.com/google/go-cmp v0.6.0 // indirect github.com/google/gofuzz v1.2.0 // indirect github.com/google/pprof v0.0.0-20241210010833-40e02aabc2ad // indirect - github.com/google/uuid v1.6.0 // indirect github.com/gorilla/websocket v1.5.0 // indirect github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect diff --git a/operator/src/go.sum b/operator/src/go.sum index 98897027..96593408 100644 --- a/operator/src/go.sum +++ b/operator/src/go.sum @@ -1,5 +1,7 @@ cel.dev/expr v0.19.0 h1:lXuo+nDhpyJSpWxpPVi5cPUwzKb+dsdOiw6IreM5yt0= cel.dev/expr v0.19.0/go.mod h1:MrpN08Q+lEBs+bGYdLxxHkZoUSsCp0nSKTs0nTymJgw= +code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c h1:5eeuG0BHx1+DHeT3AP+ISKZ2ht1UjGhm581ljqYpVeQ= +code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c/go.mod h1:QD9Lzhd/ux6eNQVUDVRJX/RKTigpewimNYBi7ivZKY8= github.com/antlr4-go/antlr/v4 v4.13.0 h1:lxCg3LAv+EUK6t1i0y1V6/SLeUi0eKEKdhQAlS8TVTI= github.com/antlr4-go/antlr/v4 v4.13.0/go.mod h1:pfChB/xh/Unjila75QW7+VU4TSnWnnk9UTnmpPaOR2g= github.com/armon/go-socks5 v0.0.0-20160902184237-e75332964ef5 h1:0CwZNZbxp69SHPdPJAN/hZIm0C4OItdklCFmMRWYpio= @@ -35,6 +37,7 @@ github.com/evanphx/json-patch/v5 v5.9.11 h1:/8HVnzMq13/3x9TPvjG08wUGqBTmZBsCWzjT github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/XEtnUf6OZxqIQTM= github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg= github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U= +github.com/fsnotify/fsnotify v1.4.7/go.mod h1:jwhsz4b93w/PPRr/qN1Yymfu8t87LnFCMoQvtojpjFo= github.com/fsnotify/fsnotify v1.7.0 h1:8JEhPFa5W2WU7YfeZzPNqzMP6Lwt7L2715Ggo0nosvA= github.com/fsnotify/fsnotify v1.7.0/go.mod h1:40Bi/Hjc2AVfZrqy+aj+yEI+/bRxZnMJyTJwOpGvigM= github.com/fxamacker/cbor/v2 v2.7.0 h1:iM5WgngdRBanHcxugY4JySA0nk1wZorNOpTgCMedv5E= @@ -52,11 +55,13 @@ github.com/go-openapi/jsonreference v0.21.0 h1:Rs+Y7hSXT83Jacb7kFyjn4ijOuVGSvOdF github.com/go-openapi/jsonreference v0.21.0/go.mod h1:LmZmgsrTkVg9LG4EaHeY8cBDslNPMo06cago5JNLkm4= github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+GrE= github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ= -github.com/go-task/slim-sprig v0.0.0-20230315185526-52ccab3ef572 h1:tfuBGBXKqDEevZMzYi5KSi8KkcZtzBcTgAUUtapy0OI= github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= +github.com/gofrs/uuid v3.3.0+incompatible h1:8K4tyRfvU1CYPgJsveYFQMhpFd/wXNM7iK6rR7UHz84= +github.com/gofrs/uuid v3.3.0+incompatible/go.mod h1:b2aQJv3Z4Fp6yNu3cdSllBxTCLRxnplIgP/c0N/04lM= github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q= github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q= +github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek= github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps= github.com/google/btree v1.1.3 h1:CVpQJjYgC4VbzxeGVHfvZrv1ctoYCAI8vbl07Fcxlyg= @@ -79,6 +84,7 @@ github.com/gorilla/websocket v1.5.0 h1:PPwGk2jz7EePpoHN/+ClbZu8SPxiqlu12wZP/3sWm github.com/gorilla/websocket v1.5.0/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 h1:bkypFPDjIYGfCYD5mRBvpqxfYX1YCS1PXdKYWi8FsN0= github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0/go.mod h1:P+Lt/0by1T8bfcF3z737NnSbmxQAppXMRziHUxPOC8k= +github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU= github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY= @@ -89,8 +95,11 @@ github.com/kisielk/errcheck v1.5.0/go.mod h1:pFxgyoBC7bSaBwPgfKdkLd5X25qrDl4LWUI github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck= github.com/klauspost/compress v1.17.11 h1:In6xLpyWOi1+C7tXUUWv2ot1QvBjxevKAaI6IXrJmUc= github.com/klauspost/compress v1.17.11/go.mod h1:pMDklpSncoRMuLFrf1W9Ss9KT+0rH90U12bZKk7uwG0= +github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo= github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= +github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ= +github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI= github.com/kr/text v0.2.0 h1:5Nx0Ya0ZqY2ygV366QzturHI13Jq95ApcVaJBhpS+AY= github.com/kr/text v0.2.0/go.mod h1:eLer722TekiGuMkidMxC/pM04lWEeraHUUmBw8l2grE= github.com/kubernetes-csi/external-snapshotter/client/v8 v8.2.0 h1:Q3jQ1NkFqv5o+F8dMmHd8SfEmlcwNeo1immFApntEwE= @@ -101,6 +110,8 @@ github.com/lib/pq v1.10.9 h1:YXG7RB+JIjhP29X+OtkiDnYaXQwpS4JEWq7dtCCRUEw= github.com/lib/pq v1.10.9/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o= github.com/mailru/easyjson v0.9.0 h1:PrnmzHw7262yW8sTBwxi1PdJA3Iw/EKBa8psRf7d9a4= github.com/mailru/easyjson v0.9.0/go.mod h1:1+xMtQp2MRNVL/V1bOzuP3aP8VNwRW55fQUto+XFtTU= +github.com/microsoft/ApplicationInsights-Go v0.4.4 h1:G4+H9WNs6ygSCe6sUyxRc2U81TI5Es90b2t/MwX5KqY= +github.com/microsoft/ApplicationInsights-Go v0.4.4/go.mod h1:fKRUseBqkw6bDiXTs3ESTiU/4YTIHsQS4W3fP2ieF4U= github.com/moby/spdystream v0.5.0 h1:7r0J1Si3QO/kjRitvSLVVFUjxMEb/YLj6S9FF62JBCU= github.com/moby/spdystream v0.5.0/go.mod h1:xBAYlnt/ay+11ShkdFKNAG7LsyK/tmNBVvVOwrfMgdI= github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q= @@ -112,8 +123,11 @@ github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ= github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f h1:y5//uYreIhSUg3J1GEMiLbxo1LJaP8RfCpH6pymGZus= github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f/go.mod h1:ZdcZmHo+o7JKHSa8/e818NopupXU1YMK5fe1lsApnBw= +github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= +github.com/onsi/ginkgo v1.8.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= github.com/onsi/ginkgo/v2 v2.22.2 h1:/3X8Panh8/WwhU/3Ssa6rCKqPLuAkVY2I0RoyDLySlU= github.com/onsi/ginkgo/v2 v2.22.2/go.mod h1:oeMosUL+8LtarXBHu/c0bx2D/K9zyQ6uX3cTyztHwsk= +github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY= github.com/onsi/gomega v1.36.2 h1:koNYke6TVk6ZmnyHrCXba/T/MoLBXFjeC1PtvYgw0A8= github.com/onsi/gomega v1.36.2/go.mod h1:DdwyADRjrc825LhMEkD76cHR5+pUnjhUN8GlHlRPHzY= github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= @@ -149,10 +163,9 @@ github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UV github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU= github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= -github.com/stretchr/testify v1.10.0 h1:Xv5erBjTwe/5IxqUQTdXv5kgmIvbHo3QQyRwhJsOfJA= -github.com/stretchr/testify v1.10.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY= github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +github.com/tedsuo/ifrit v0.0.0-20180802180643-bea94bb476cc/go.mod h1:eyZnKCc955uh98WQvzOm0dgAeLnf2O0Rz0LPoC5ze+0= github.com/thoas/go-funk v0.9.3 h1:7+nAEx3kn5ZJcnDm2Bh23N2yOtweO14bi//dvRtgLpw= github.com/thoas/go-funk v0.9.3/go.mod h1:+IWnUfUmFO1+WVYQWQtIJHeRRdaIyyYglZN7xzUPe4Q= github.com/x448/float16 v0.8.4 h1:qLwI1I70+NjRFUR3zs1JPUCgaCXSh3SW62uAKT1mSBM= @@ -192,6 +205,7 @@ golang.org/x/exp v0.0.0-20241004190924-225e2abe05e6 h1:1wqE9dj9NpSm04INVsJhhEUzh golang.org/x/exp v0.0.0-20241004190924-225e2abe05e6/go.mod h1:NQtJDoLvd6faHhE7m4T/1IY708gDefGGjR/iUW8yQQ8= golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= +golang.org/x/net v0.0.0-20180906233101-161cd47e91fd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= @@ -200,11 +214,13 @@ golang.org/x/net v0.38.0 h1:vRMAPTMaeGqVhG5QyLJHqNDwecKTomGeqbnfZyKlBI8= golang.org/x/net v0.38.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8= golang.org/x/oauth2 v0.27.0 h1:da9Vo7/tDv5RH/7nZDz1eMGS/q1Vv1N/7FCrBhI9I3M= golang.org/x/oauth2 v0.27.0/go.mod h1:onh5ek6nERTohokkhCD/y2cV4Do3fxFHFuAejCkRWT8= +golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.12.0 h1:MHc5BpPuC30uJk597Ri8TV3CNZcTLu6B6z4lJy+g6Jw= golang.org/x/sync v0.12.0/go.mod h1:1dzgHSNfp02xaA81J2MS99Qcpr2w7fw1gpm99rleRqA= +golang.org/x/sys v0.0.0-20180909124046-d0be0721c37e/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= @@ -239,12 +255,16 @@ google.golang.org/grpc v1.70.0/go.mod h1:ofIJqVKDXx/JiXrwr2IG4/zwdH9txy3IlF40Rmc google.golang.org/protobuf v1.36.5 h1:tPhr+woSbjfYvY6/GPufUoYizxw1cF/yFoxJ2fmpwlM= google.golang.org/protobuf v1.36.5/go.mod h1:9fA7Ob0pmnwhb644+1+CVWFRbNajQ6iRojtC/QF5bRE= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= +gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= gopkg.in/evanphx/json-patch.v4 v4.12.0 h1:n6jtcsulIzXPJaxegRbvFNNrZDjbij7ny3gmSPG+6V4= gopkg.in/evanphx/json-patch.v4 v4.12.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M= +gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys= gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc= gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= +gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw= +gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/operator/src/internal/controller/backup_controller.go b/operator/src/internal/controller/backup_controller.go index 748e630c..1eb7efbf 100644 --- a/operator/src/internal/controller/backup_controller.go +++ b/operator/src/internal/controller/backup_controller.go @@ -49,6 +49,8 @@ func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctr // Delete the Backup resource if it has expired if backup.Status.IsExpired() { r.Recorder.Event(backup, "Normal", "BackupExpired", "Backup has expired and will be deleted") + // Track backup deletion telemetry before deleting + r.trackBackupDeleted(ctx, backup, "expired") if err := r.Delete(ctx, backup); err != nil { r.Recorder.Event(backup, "Warning", "BackupDeleteFailed", "Failed to delete expired Backup: "+err.Error()) return ctrl.Result{}, err @@ -194,6 +196,9 @@ func (r *BackupReconciler) createCNPGBackup(ctx context.Context, backup *dbprevi r.Recorder.Event(backup, "Normal", "BackupInitialized", "Successfully initialized backup") + // Track backup creation telemetry + r.trackBackupCreated(ctx, backup, cluster, "on-demand") + // Requeue to check status return ctrl.Result{RequeueAfter: 5 * time.Second}, nil } @@ -265,6 +270,66 @@ func (r *BackupReconciler) SetBackupPhaseSkipped(ctx context.Context, backup *db return ctrl.Result{RequeueAfter: requeueAfter}, nil } +// trackBackupCreated tracks backup creation telemetry. +func (r *BackupReconciler) trackBackupCreated(ctx context.Context, backup *dbpreview.Backup, cluster *dbpreview.DocumentDB, backupType string) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + // Get or create backup ID + backupID, err := r.TelemetryMgr.GUIDs.GetOrCreateBackupID(ctx, backup) + if err != nil { + log.FromContext(ctx).V(1).Info("Failed to get backup telemetry ID", "error", err) + return + } + + clusterID := r.TelemetryMgr.GUIDs.GetClusterID(cluster) + + // Determine if this is from primary cluster + replicationContext, _ := util.GetReplicationContext(ctx, r.Client, *cluster) + isPrimary := replicationContext == nil || replicationContext.IsPrimary() + + retentionDays := 30 // default + if cluster.Spec.Backup != nil && cluster.Spec.Backup.RetentionDays > 0 { + retentionDays = cluster.Spec.Backup.RetentionDays + } + + r.TelemetryMgr.Events.TrackBackupCreated(telemetry.BackupCreatedEvent{ + BackupID: backupID, + ClusterID: clusterID, + NamespaceHash: telemetry.HashNamespace(backup.Namespace), + BackupType: backupType, + BackupMethod: "VolumeSnapshot", + BackupPhase: "starting", + RetentionDays: retentionDays, + CloudProvider: telemetry.MapCloudProviderToString(cluster.Spec.Environment), + IsPrimaryCluster: isPrimary, + }) +} + +// trackBackupDeleted tracks backup deletion telemetry. +func (r *BackupReconciler) trackBackupDeleted(ctx context.Context, backup *dbpreview.Backup, reason string) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + backupID := r.TelemetryMgr.GUIDs.GetBackupID(backup) + if backupID == "" { + return + } + + ageDays := 0 + if backup.CreationTimestamp.Time.Year() > 1 { + ageDays = int(time.Since(backup.CreationTimestamp.Time).Hours() / 24) + } + + r.TelemetryMgr.Events.TrackBackupDeleted(telemetry.BackupDeletedEvent{ + BackupID: backupID, + DeletionReason: reason, + BackupAgeDays: ageDays, + }) +} + // SetupWithManager sets up the controller with the Manager. func (r *BackupReconciler) SetupWithManager(mgr ctrl.Manager) error { // Register VolumeSnapshotClass with the scheme diff --git a/operator/src/internal/controller/documentdb_controller.go b/operator/src/internal/controller/documentdb_controller.go index 3c919456..bdea1de2 100644 --- a/operator/src/internal/controller/documentdb_controller.go +++ b/operator/src/internal/controller/documentdb_controller.go @@ -147,11 +147,14 @@ func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) if err := r.Client.Get(ctx, types.NamespacedName{Name: desiredCnpgCluster.Name, Namespace: req.Namespace}, currentCnpgCluster); err != nil { if errors.IsNotFound(err) { + clusterCreateStart := time.Now() if err := r.Client.Create(ctx, desiredCnpgCluster); err != nil { logger.Error(err, "Failed to create CNPG Cluster") return ctrl.Result{RequeueAfter: RequeueAfterShort}, nil } logger.Info("CNPG Cluster created successfully", "Cluster.Name", desiredCnpgCluster.Name, "Namespace", desiredCnpgCluster.Namespace) + // Track cluster creation telemetry + r.TrackClusterCreated(ctx, documentdb, time.Since(clusterCreateStart).Seconds()) return ctrl.Result{RequeueAfter: RequeueAfterLong}, nil } logger.Error(err, "Failed to get CNPG Cluster") @@ -159,11 +162,14 @@ func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) } // Check if anything has changed in the generated cnpg spec + updateStart := time.Now() err, requeueTime := r.TryUpdateCluster(ctx, currentCnpgCluster, desiredCnpgCluster, documentdb, replicationContext) if err != nil { logger.Error(err, "Failed to update CNPG Cluster") } if requeueTime > 0 { + // Track cluster update if something changed + r.trackClusterUpdated(ctx, documentdb, "configuration", time.Since(updateStart).Seconds()) return ctrl.Result{RequeueAfter: requeueTime}, nil } @@ -521,6 +527,25 @@ func (r *DocumentDBReconciler) trackReconcileDuration(ctx context.Context, resou }) } +// trackClusterUpdated tracks when a cluster is updated. +func (r *DocumentDBReconciler) trackClusterUpdated(ctx context.Context, documentdb *dbpreview.DocumentDB, updateType string, durationSeconds float64) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + clusterID := r.TelemetryMgr.GUIDs.GetClusterID(documentdb) + if clusterID == "" { + return + } + + r.TelemetryMgr.Events.TrackClusterUpdated(telemetry.ClusterUpdatedEvent{ + ClusterID: clusterID, + NamespaceHash: telemetry.HashNamespace(documentdb.Namespace), + UpdateType: updateType, + UpdateDurationSeconds: durationSeconds, + }) +} + // TrackClusterCreated tracks when a new cluster is created. func (r *DocumentDBReconciler) TrackClusterCreated(ctx context.Context, documentdb *dbpreview.DocumentDB, durationSeconds float64) { if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { diff --git a/operator/src/internal/controller/scheduledbackup_controller.go b/operator/src/internal/controller/scheduledbackup_controller.go index bb8cea9b..a091d226 100644 --- a/operator/src/internal/controller/scheduledbackup_controller.go +++ b/operator/src/internal/controller/scheduledbackup_controller.go @@ -84,6 +84,9 @@ func (r *ScheduledBackupReconciler) Reconcile(ctx context.Context, req ctrl.Requ return ctrl.Result{}, err } + // Track scheduled backup execution telemetry + r.trackScheduledBackupTriggered(ctx, scheduledBackup) + scheduledBackup.Status.LastScheduledTime = &metav1.Time{Time: now} // Calculate next run time @@ -141,6 +144,53 @@ func (r *ScheduledBackupReconciler) ensureOwnerReference(ctx context.Context, sc return nil } +// trackScheduledBackupCreated tracks scheduled backup creation telemetry. +func (r *ScheduledBackupReconciler) trackScheduledBackupCreated(ctx context.Context, scheduledBackup *dbpreview.ScheduledBackup, cluster *dbpreview.DocumentDB) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + // Get or create scheduled backup ID + scheduledBackupID, err := r.TelemetryMgr.GUIDs.GetOrCreateScheduledBackupID(ctx, scheduledBackup) + if err != nil { + log.FromContext(ctx).V(1).Info("Failed to get scheduled backup telemetry ID", "error", err) + return + } + + clusterID := r.TelemetryMgr.GUIDs.GetClusterID(cluster) + + retentionDays := 30 // default + if cluster.Spec.Backup != nil && cluster.Spec.Backup.RetentionDays > 0 { + retentionDays = cluster.Spec.Backup.RetentionDays + } + + r.TelemetryMgr.Events.TrackScheduledBackupCreated(telemetry.ScheduledBackupCreatedEvent{ + ScheduledBackupID: scheduledBackupID, + ClusterID: clusterID, + ScheduleFrequency: string(telemetry.CategorizeScheduleFrequency(scheduledBackup.Spec.Schedule)), + RetentionDays: retentionDays, + }) +} + +// trackScheduledBackupTriggered tracks when a scheduled backup actually triggers. +func (r *ScheduledBackupReconciler) trackScheduledBackupTriggered(ctx context.Context, scheduledBackup *dbpreview.ScheduledBackup) { + if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { + return + } + + // Fetch the cluster to get telemetry IDs + cluster := &dbpreview.DocumentDB{} + clusterKey := client.ObjectKey{ + Name: scheduledBackup.Spec.Cluster.Name, + Namespace: scheduledBackup.Namespace, + } + if err := r.Get(ctx, clusterKey, cluster); err != nil { + return + } + + r.trackScheduledBackupCreated(ctx, scheduledBackup, cluster) +} + // SetupWithManager sets up the controller with the Manager. func (r *ScheduledBackupReconciler) SetupWithManager(mgr ctrl.Manager) error { // Register field index for spec.cluster so we can query Backups by cluster name diff --git a/operator/src/internal/telemetry/client.go b/operator/src/internal/telemetry/client.go index a69f0241..d449e439 100644 --- a/operator/src/internal/telemetry/client.go +++ b/operator/src/internal/telemetry/client.go @@ -4,101 +4,35 @@ package telemetry import ( - "bytes" - "context" - "encoding/json" "fmt" - "net/http" "os" - "sync" + "strings" "time" "github.com/go-logr/logr" + "github.com/microsoft/ApplicationInsights-Go/appinsights" ) const ( - // DefaultBatchInterval is the default interval for batching telemetry events. - DefaultBatchInterval = 30 * time.Second - // DefaultMaxBatchSize is the maximum number of events to batch before sending. - DefaultMaxBatchSize = 100 - // DefaultMaxRetries is the maximum number of retries for failed telemetry submissions. - DefaultMaxRetries = 3 - // DefaultRetryBaseDelay is the base delay for exponential backoff retries. - DefaultRetryBaseDelay = 1 * time.Second - // DefaultBufferSize is the size of the local buffer for events when AppInsights is unreachable. - DefaultBufferSize = 1000 - // EnvAppInsightsKey is the environment variable for the Application Insights instrumentation key. EnvAppInsightsKey = "APPINSIGHTS_INSTRUMENTATIONKEY" // EnvAppInsightsConnectionString is the environment variable for the Application Insights connection string. EnvAppInsightsConnectionString = "APPLICATIONINSIGHTS_CONNECTION_STRING" // EnvTelemetryEnabled is the environment variable to enable/disable telemetry. EnvTelemetryEnabled = "DOCUMENTDB_TELEMETRY_ENABLED" - - // AppInsightsTrackEndpoint is the Application Insights ingestion endpoint. - AppInsightsTrackEndpoint = "https://dc.services.visualstudio.com/v2/track" ) -// TelemetryClient handles sending telemetry to Application Insights. +// TelemetryClient handles sending telemetry to Application Insights using the official SDK. type TelemetryClient struct { - instrumentationKey string - ingestionEndpoint string - enabled bool - operatorContext *OperatorContext - logger logr.Logger - - // Batching - eventBuffer []telemetryEnvelope - bufferMutex sync.Mutex - batchInterval time.Duration - maxBatchSize int - - // Retry and buffering - maxRetries int - retryBaseDelay time.Duration - localBuffer []telemetryEnvelope - localMutex sync.Mutex - maxBufferSize int - - // HTTP client - httpClient *http.Client - - // Shutdown - stopChan chan struct{} - wg sync.WaitGroup -} - -// telemetryEnvelope wraps events for Application Insights API. -type telemetryEnvelope struct { - Name string `json:"name"` - Time string `json:"time"` - IKey string `json:"iKey"` - Tags map[string]string `json:"tags"` - Data telemetryData `json:"data"` -} - -type telemetryData struct { - BaseType string `json:"baseType"` - BaseData map[string]interface{} `json:"baseData"` + client appinsights.TelemetryClient + enabled bool + operatorContext *OperatorContext + logger logr.Logger } // ClientOption configures the TelemetryClient. type ClientOption func(*TelemetryClient) -// WithBatchInterval sets the batch interval for sending telemetry. -func WithBatchInterval(interval time.Duration) ClientOption { - return func(c *TelemetryClient) { - c.batchInterval = interval - } -} - -// WithMaxBatchSize sets the maximum batch size. -func WithMaxBatchSize(size int) ClientOption { - return func(c *TelemetryClient) { - c.maxBatchSize = size - } -} - // WithLogger sets the logger for the telemetry client. func WithLogger(logger logr.Logger) ClientOption { return func(c *TelemetryClient) { @@ -106,79 +40,105 @@ func WithLogger(logger logr.Logger) ClientOption { } } -// NewTelemetryClient creates a new TelemetryClient. +// NewTelemetryClient creates a new TelemetryClient using the official Application Insights SDK. func NewTelemetryClient(ctx *OperatorContext, opts ...ClientOption) *TelemetryClient { - client := &TelemetryClient{ - operatorContext: ctx, - enabled: true, - batchInterval: DefaultBatchInterval, - maxBatchSize: DefaultMaxBatchSize, - maxRetries: DefaultMaxRetries, - retryBaseDelay: DefaultRetryBaseDelay, - maxBufferSize: DefaultBufferSize, - eventBuffer: make([]telemetryEnvelope, 0), - localBuffer: make([]telemetryEnvelope, 0), - ingestionEndpoint: AppInsightsTrackEndpoint, - httpClient: &http.Client{ - Timeout: 30 * time.Second, - }, - stopChan: make(chan struct{}), + tc := &TelemetryClient{ + operatorContext: ctx, + enabled: true, } // Apply options for _, opt := range opts { - opt(client) + opt(tc) } // Check if telemetry is enabled if enabled := os.Getenv(EnvTelemetryEnabled); enabled == "false" { - client.enabled = false - if client.logger.GetSink() != nil { - client.logger.Info("Telemetry collection is disabled via environment variable") + tc.enabled = false + if tc.logger.GetSink() != nil { + tc.logger.Info("Telemetry collection is disabled via environment variable") } - return client + return tc } // Get instrumentation key from environment - client.instrumentationKey = os.Getenv(EnvAppInsightsKey) - if client.instrumentationKey == "" { + instrumentationKey := os.Getenv(EnvAppInsightsKey) + if instrumentationKey == "" { // Try connection string connStr := os.Getenv(EnvAppInsightsConnectionString) - client.instrumentationKey = parseInstrumentationKeyFromConnectionString(connStr) + instrumentationKey = parseInstrumentationKeyFromConnectionString(connStr) } - if client.instrumentationKey == "" { - client.enabled = false - if client.logger.GetSink() != nil { - client.logger.Info("No Application Insights instrumentation key found, telemetry disabled") + if instrumentationKey == "" { + tc.enabled = false + if tc.logger.GetSink() != nil { + tc.logger.Info("No Application Insights instrumentation key found, telemetry disabled") } - return client + return tc } - return client -} + // Create telemetry configuration + telemetryConfig := appinsights.NewTelemetryConfiguration(instrumentationKey) -// Start begins the background batch processing goroutine. -func (c *TelemetryClient) Start() { - if !c.enabled { - return + // Configure batching - send every 30 seconds or when batch reaches 100 items + telemetryConfig.MaxBatchSize = 100 + telemetryConfig.MaxBatchInterval = 30 * time.Second + + // Check for custom endpoint from connection string + connStr := os.Getenv(EnvAppInsightsConnectionString) + if endpoint := parseIngestionEndpointFromConnectionString(connStr); endpoint != "" { + telemetryConfig.EndpointUrl = strings.TrimSuffix(endpoint, "/") + "/v2/track" + } + + // Create the client + tc.client = appinsights.NewTelemetryClientFromConfig(telemetryConfig) + + // Set common context tags + tc.client.Context().Tags.Cloud().SetRole("documentdb-operator") + tc.client.Context().Tags.Cloud().SetRoleInstance(ctx.OperatorNamespaceHash) + tc.client.Context().Tags.Application().SetVer(ctx.OperatorVersion) + + // Set common properties that will be added to all telemetry + tc.client.Context().CommonProperties["kubernetes_distribution"] = string(ctx.KubernetesDistribution) + tc.client.Context().CommonProperties["kubernetes_version"] = ctx.KubernetesVersion + tc.client.Context().CommonProperties["operator_version"] = ctx.OperatorVersion + if ctx.Region != "" { + tc.client.Context().CommonProperties["region"] = ctx.Region + } + + // Enable diagnostics logging if logger is available + if tc.logger.GetSink() != nil { + appinsights.NewDiagnosticsMessageListener(func(msg string) error { + tc.logger.V(1).Info("Application Insights diagnostic", "message", msg) + return nil + }) } - c.wg.Add(1) - go c.batchProcessor() + return tc +} + +// Start begins the telemetry client (no-op for SDK-based client as it handles this internally). +func (c *TelemetryClient) Start() { + // The official SDK handles background processing internally } // Stop gracefully stops the telemetry client and flushes remaining events. func (c *TelemetryClient) Stop() { - if !c.enabled { + if !c.enabled || c.client == nil { return } - close(c.stopChan) - c.wg.Wait() - - // Flush any remaining events - c.flush() + // Close the channel with a timeout for retries + select { + case <-c.client.Channel().Close(10 * time.Second): + if c.logger.GetSink() != nil { + c.logger.Info("Telemetry channel closed successfully") + } + case <-time.After(30 * time.Second): + if c.logger.GetSink() != nil { + c.logger.Info("Telemetry channel close timed out") + } + } } // IsEnabled returns whether telemetry collection is enabled. @@ -188,243 +148,81 @@ func (c *TelemetryClient) IsEnabled() bool { // TrackEvent sends a custom event to Application Insights. func (c *TelemetryClient) TrackEvent(eventName string, properties map[string]interface{}) { - if !c.enabled { + if !c.enabled || c.client == nil { return } - envelope := c.createEnvelope("Microsoft.ApplicationInsights.Event", map[string]interface{}{ - "name": eventName, - "properties": c.addContextProperties(properties), - }) - - c.bufferMutex.Lock() - c.eventBuffer = append(c.eventBuffer, envelope) - shouldFlush := len(c.eventBuffer) >= c.maxBatchSize - c.bufferMutex.Unlock() + event := appinsights.NewEventTelemetry(eventName) - if shouldFlush { - go c.flush() + // Add properties + for k, v := range properties { + event.Properties[k] = fmt.Sprintf("%v", v) } + + c.client.Track(event) } // TrackMetric sends a metric to Application Insights. func (c *TelemetryClient) TrackMetric(metricName string, value float64, properties map[string]interface{}) { - if !c.enabled { + if !c.enabled || c.client == nil { return } - envelope := c.createEnvelope("Microsoft.ApplicationInsights.Metric", map[string]interface{}{ - "metrics": []map[string]interface{}{ - { - "name": metricName, - "value": value, - }, - }, - "properties": c.addContextProperties(properties), - }) - - c.bufferMutex.Lock() - c.eventBuffer = append(c.eventBuffer, envelope) - c.bufferMutex.Unlock() + metric := appinsights.NewMetricTelemetry(metricName, value) + + // Add properties + for k, v := range properties { + metric.Properties[k] = fmt.Sprintf("%v", v) + } + + c.client.Track(metric) } // TrackException sends an exception/error to Application Insights. func (c *TelemetryClient) TrackException(err error, properties map[string]interface{}) { - if !c.enabled { + if !c.enabled || c.client == nil { return } // Sanitize error message to remove potential PII sanitizedMessage := sanitizeErrorMessage(err.Error()) - envelope := c.createEnvelope("Microsoft.ApplicationInsights.Exception", map[string]interface{}{ - "exceptions": []map[string]interface{}{ - { - "message": sanitizedMessage, - }, - }, - "properties": c.addContextProperties(properties), - }) - - c.bufferMutex.Lock() - c.eventBuffer = append(c.eventBuffer, envelope) - c.bufferMutex.Unlock() -} - -// createEnvelope creates a telemetry envelope for Application Insights. -func (c *TelemetryClient) createEnvelope(baseType string, baseData map[string]interface{}) telemetryEnvelope { - return telemetryEnvelope{ - Name: baseType, - Time: time.Now().UTC().Format(time.RFC3339Nano), - IKey: c.instrumentationKey, - Tags: map[string]string{ - "ai.cloud.role": "documentdb-operator", - "ai.cloud.roleInstance": c.operatorContext.OperatorNamespaceHash, - "ai.application.ver": c.operatorContext.OperatorVersion, - }, - Data: telemetryData{ - BaseType: baseType, - BaseData: baseData, - }, - } -} - -// addContextProperties adds operator context to event properties. -func (c *TelemetryClient) addContextProperties(properties map[string]interface{}) map[string]interface{} { - if properties == nil { - properties = make(map[string]interface{}) - } - - // Add operator context (these are added to all events as per spec) - properties["kubernetes_distribution"] = string(c.operatorContext.KubernetesDistribution) - properties["kubernetes_version"] = c.operatorContext.KubernetesVersion - properties["operator_version"] = c.operatorContext.OperatorVersion - - if c.operatorContext.Region != "" { - properties["region"] = c.operatorContext.Region - } - - return properties -} - -// batchProcessor runs in the background to periodically send batched events. -func (c *TelemetryClient) batchProcessor() { - defer c.wg.Done() - - ticker := time.NewTicker(c.batchInterval) - defer ticker.Stop() - - for { - select { - case <-ticker.C: - c.flush() - case <-c.stopChan: - return - } - } -} - -// flush sends all buffered events to Application Insights. -func (c *TelemetryClient) flush() { - c.bufferMutex.Lock() - if len(c.eventBuffer) == 0 { - c.bufferMutex.Unlock() - // Also try to send locally buffered events - c.flushLocalBuffer() - return - } - - events := c.eventBuffer - c.eventBuffer = make([]telemetryEnvelope, 0) - c.bufferMutex.Unlock() - - // Send events with retry - if err := c.sendWithRetry(events); err != nil { - // Store in local buffer if send fails - c.localMutex.Lock() - c.localBuffer = append(c.localBuffer, events...) - // Trim buffer if it exceeds max size - if len(c.localBuffer) > c.maxBufferSize { - c.localBuffer = c.localBuffer[len(c.localBuffer)-c.maxBufferSize:] - } - c.localMutex.Unlock() - - if c.logger.GetSink() != nil { - c.logger.Error(err, "Failed to send telemetry, buffered locally", "eventCount", len(events)) - } - } -} - -// flushLocalBuffer attempts to send locally buffered events. -func (c *TelemetryClient) flushLocalBuffer() { - c.localMutex.Lock() - if len(c.localBuffer) == 0 { - c.localMutex.Unlock() - return - } - - events := c.localBuffer - c.localBuffer = make([]telemetryEnvelope, 0) - c.localMutex.Unlock() + exception := appinsights.NewExceptionTelemetry(sanitizedMessage) - if err := c.sendWithRetry(events); err != nil { - // Put back in buffer - c.localMutex.Lock() - c.localBuffer = append(events, c.localBuffer...) - if len(c.localBuffer) > c.maxBufferSize { - c.localBuffer = c.localBuffer[:c.maxBufferSize] - } - c.localMutex.Unlock() - } -} - -// sendWithRetry sends events to Application Insights with exponential backoff retry. -func (c *TelemetryClient) sendWithRetry(events []telemetryEnvelope) error { - var lastErr error - - for attempt := 0; attempt < c.maxRetries; attempt++ { - if attempt > 0 { - delay := c.retryBaseDelay * time.Duration(1<= 400 { - return fmt.Errorf("telemetry submission failed with status: %d", resp.StatusCode) - } - - return nil + return "" } -// parseInstrumentationKeyFromConnectionString extracts the instrumentation key from a connection string. -func parseInstrumentationKeyFromConnectionString(connStr string) string { +// parseIngestionEndpointFromConnectionString extracts the ingestion endpoint from a connection string. +func parseIngestionEndpointFromConnectionString(connStr string) string { if connStr == "" { return "" } // Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;... - for _, part := range bytes.Split([]byte(connStr), []byte(";")) { - if bytes.HasPrefix(part, []byte("InstrumentationKey=")) { - return string(bytes.TrimPrefix(part, []byte("InstrumentationKey="))) + for _, part := range strings.Split(connStr, ";") { + if strings.HasPrefix(part, "IngestionEndpoint=") { + return strings.TrimPrefix(part, "IngestionEndpoint=") } } From 615a3da12ab6116e9d3e5f2f419df523246925ad Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Wed, 25 Feb 2026 18:03:33 -0500 Subject: [PATCH 5/7] Address Copilot PR review comments for telemetry implementation - Fix backup type detection: check labels to distinguish scheduled vs on-demand backups - Fix retention days: check backup-level override before using cluster default - Fix reconciliation duration tracking: use named return values to avoid mutable err variable issue - Fix sanitizeError: return categorized error types instead of raw messages (PII concern) - Fix trackReconcileError: omit ResourceID to avoid PII (user-provided names) - Fix CategorizeScheduleFrequency: correctly classify step expressions (*/5) as custom - Fix connection string parsing: trim whitespace to handle copy-paste errors - Fix telemetry-configuration.md: use correct Helm values field names - Fix appinsights-metrics.md: use namespace_hash instead of namespace - Add github-secrets-telemetry-setup.md: documentation for CI/CD telemetry setup - Clean up go.mod/go.sum: remove duplicate entries from merge conflict resolution --- docs/designs/appinsights-metrics.md | 2 +- .../designs/github-secrets-telemetry-setup.md | 223 ++++++++++++++++++ docs/designs/telemetry-configuration.md | 6 +- operator/src/go.mod | 21 +- operator/src/go.sum | 120 ++-------- .../internal/controller/backup_controller.go | 13 +- .../controller/documentdb_controller.go | 71 ++++-- operator/src/internal/telemetry/client.go | 4 + operator/src/internal/telemetry/utils.go | 25 +- 9 files changed, 333 insertions(+), 152 deletions(-) create mode 100644 docs/designs/github-secrets-telemetry-setup.md diff --git a/docs/designs/appinsights-metrics.md b/docs/designs/appinsights-metrics.md index 65ddbe0d..66872bd1 100644 --- a/docs/designs/appinsights-metrics.md +++ b/docs/designs/appinsights-metrics.md @@ -21,7 +21,7 @@ This document specifies all telemetry data points to be collected by Application - **Metric**: `operator.health.status` - **Value**: `1` (healthy) or `0` (unhealthy) - **Frequency**: Every 60 seconds -- **Dimensions**: `pod_name`, `namespace` +- **Dimensions**: `pod_name`, `namespace_hash` --- diff --git a/docs/designs/github-secrets-telemetry-setup.md b/docs/designs/github-secrets-telemetry-setup.md new file mode 100644 index 00000000..827f6d60 --- /dev/null +++ b/docs/designs/github-secrets-telemetry-setup.md @@ -0,0 +1,223 @@ +# GitHub Secrets Setup for Application Insights Telemetry + +This document describes how to configure GitHub secrets for Application Insights telemetry collection in the DocumentDB Kubernetes Operator CI/CD pipeline. + +## Overview + +The DocumentDB Operator uses Application Insights to collect anonymous telemetry data about operator usage patterns. This helps the team understand: + +- How many people use the operator +- Which cloud providers are most common (AKS, EKS, GKE) +- Common cluster configurations +- Error patterns and operational issues + +To enable telemetry in CI/CD workflows, you need to configure a GitHub secret containing the Application Insights connection string. + +## Prerequisites + +1. An Azure Application Insights resource +2. Admin access to the GitHub repository (to create secrets) + +## Step 1: Create Application Insights Resource + +If you don't have an Application Insights resource, create one in Azure: + +```bash +# Create a resource group (if needed) +az group create --name documentdb-telemetry-rg --location eastus2 + +# Create Application Insights resource +az monitor app-insights component create \ + --app documentdb-operator-telemetry \ + --location eastus2 \ + --resource-group documentdb-telemetry-rg \ + --kind web \ + --application-type web +``` + +## Step 2: Get the Connection String + +Retrieve the connection string from your Application Insights resource: + +```bash +az monitor app-insights component show \ + --app documentdb-operator-telemetry \ + --resource-group documentdb-telemetry-rg \ + --query connectionString \ + --output tsv +``` + +The connection string will look like: +``` +InstrumentationKey=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx;IngestionEndpoint=https://eastus2-2.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus2.livediagnostics.monitor.azure.com/;ApplicationId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx +``` + +## Step 3: Create GitHub Secret + +### Via GitHub UI + +1. Navigate to your repository on GitHub +2. Go to **Settings** → **Secrets and variables** → **Actions** +3. Click **New repository secret** +4. Set the following: + - **Name**: `APPINSIGHTS_CONNECTION_STRING` + - **Secret**: Paste the connection string from Step 2 +5. Click **Add secret** + +### Via GitHub CLI + +```bash +# Authenticate with GitHub CLI (if not already) +gh auth login + +# Set the secret +gh secret set APPINSIGHTS_CONNECTION_STRING --body "InstrumentationKey=xxx;IngestionEndpoint=https://..." +``` + +## Step 4: Use the Secret in GitHub Actions + +Reference the secret in your GitHub Actions workflow: + +```yaml +# .github/workflows/test-integration.yml +name: Integration Tests + +on: + push: + branches: [main] + pull_request: + branches: [main] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Build and Deploy Operator + env: + APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.APPINSIGHTS_CONNECTION_STRING }} + run: | + # Build operator image with telemetry enabled + make docker-build + + # Deploy with telemetry connection string + helm install documentdb-operator ./operator/documentdb-helm-chart \ + --set telemetry.enabled=true \ + --set telemetry.connectionString="${APPLICATIONINSIGHTS_CONNECTION_STRING}" +``` + +## Step 5: Helm Chart Configuration + +The operator Helm chart supports telemetry configuration via these values: + +```yaml +# values.yaml +telemetry: + # Enable/disable telemetry collection + enabled: true + + # Option 1: Direct connection string (for CI/CD, injected from secrets) + connectionString: "" + + # Option 2: Instrumentation key only + instrumentationKey: "" + + # Option 3: Use an existing Kubernetes secret + existingSecret: "" +``` + +### CI/CD Deployment Example + +```bash +# Deploy with connection string from environment variable +helm upgrade --install documentdb-operator ./operator/documentdb-helm-chart \ + --namespace documentdb-operator \ + --create-namespace \ + --set telemetry.enabled=true \ + --set "telemetry.connectionString=${APPLICATIONINSIGHTS_CONNECTION_STRING}" +``` + +## Alternative: Using Kubernetes Secrets + +For production deployments, you may want to store the connection string in a Kubernetes secret: + +```yaml +# Create a secret with the connection string +apiVersion: v1 +kind: Secret +metadata: + name: documentdb-telemetry-secret + namespace: documentdb-operator +type: Opaque +stringData: + APPLICATIONINSIGHTS_CONNECTION_STRING: "InstrumentationKey=xxx;IngestionEndpoint=https://..." +``` + +Then reference it in Helm: + +```yaml +# values.yaml +telemetry: + enabled: true + existingSecret: "documentdb-telemetry-secret" +``` + +## Security Considerations + +1. **Secret Rotation**: Rotate the Application Insights key periodically +2. **Scope**: Use repository-level secrets (not organization-level) for better isolation +3. **Access Control**: Limit who can view/edit repository secrets +4. **Environment-Specific**: Consider using different App Insights resources for dev/prod + +## Verifying Telemetry Collection + +After deployment, verify telemetry is being sent: + +1. Check operator logs for telemetry transmission: + ```bash + kubectl logs -n documentdb-operator deployment/documentdb-operator | grep -i "telemetry\|appinsights" + ``` + +2. Query Application Insights: + ```kusto + // In Azure Portal > Application Insights > Logs + customEvents + | where timestamp > ago(1h) + | where name startswith "documentdb" + | summarize count() by name + ``` + +## Troubleshooting + +### Telemetry Not Appearing + +1. Verify the connection string is correct: + ```bash + echo $APPLICATIONINSIGHTS_CONNECTION_STRING | grep "InstrumentationKey=" + ``` + +2. Check if the operator has the environment variable: + ```bash + kubectl exec -n documentdb-operator deployment/documentdb-operator -- env | grep APPINSIGHTS + ``` + +3. Check operator logs for errors: + ```bash + kubectl logs -n documentdb-operator deployment/documentdb-operator | grep -i error + ``` + +### Ingestion Delays + +Application Insights has a batching interval (default: 30 seconds). Events may take up to a few minutes to appear in the portal. + +## Data Privacy + +The telemetry system is designed with privacy in mind: + +- **No PII**: No cluster names, namespaces, IP addresses, or user-provided identifiers +- **Hashed namespaces**: Namespace names are SHA-256 hashed +- **GUIDs for correlation**: Auto-generated GUIDs are used instead of resource names +- **Categorized errors**: Error messages are categorized, not raw strings + +See [appinsights-metrics.md](./appinsights-metrics.md) for the complete list of collected telemetry. diff --git a/docs/designs/telemetry-configuration.md b/docs/designs/telemetry-configuration.md index 152501a2..d6b03646 100644 --- a/docs/designs/telemetry-configuration.md +++ b/docs/designs/telemetry-configuration.md @@ -26,9 +26,11 @@ When installing via Helm, you can configure telemetry in your values.yaml: # values.yaml telemetry: enabled: true - appInsightsInstrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE" + instrumentationKey: "YOUR-INSTRUMENTATION-KEY-HERE" # Or use connection string: - # appInsightsConnectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..." + # connectionString: "InstrumentationKey=xxx;IngestionEndpoint=https://..." + # Or use an existing secret containing APPINSIGHTS_INSTRUMENTATIONKEY / APPLICATIONINSIGHTS_CONNECTION_STRING: + # existingSecret: "documentdb-operator-telemetry" ``` ### Kubernetes Secret diff --git a/operator/src/go.mod b/operator/src/go.mod index a9995f46..981c8406 100644 --- a/operator/src/go.mod +++ b/operator/src/go.mod @@ -5,14 +5,12 @@ go 1.25.7 godebug default=go1.23 require ( - github.com/google/uuid v1.6.0 - github.com/microsoft/ApplicationInsights-Go v0.4.4 - github.com/onsi/ginkgo/v2 v2.22.2 - github.com/onsi/gomega v1.36.2 github.com/cert-manager/cert-manager v1.19.3 github.com/cloudnative-pg/cloudnative-pg v1.28.1 github.com/cloudnative-pg/machinery v0.3.3 github.com/go-logr/logr v1.4.3 + github.com/google/uuid v1.6.0 + github.com/microsoft/ApplicationInsights-Go v0.4.4 github.com/onsi/ginkgo/v2 v2.28.1 github.com/onsi/gomega v1.39.1 github.com/stretchr/testify v1.11.1 @@ -30,10 +28,9 @@ require ( ) require ( - cel.dev/expr v0.19.0 // indirect - github.com/antlr4-go/antlr/v4 v4.13.0 // indirect - github.com/asaskevich/govalidator v0.0.0-20230301143203-a9d515a09cc2 // indirect + cel.dev/expr v0.24.0 // indirect github.com/Masterminds/semver/v3 v3.4.0 // indirect + github.com/antlr4-go/antlr/v4 v4.13.1 // indirect github.com/cenkalti/backoff/v5 v5.0.3 // indirect github.com/cloudnative-pg/cnpg-i v0.3.1 // indirect github.com/go-openapi/swag/cmdutils v0.25.4 // indirect @@ -57,8 +54,6 @@ require ( ) require ( - cel.dev/expr v0.24.0 // indirect - github.com/antlr4-go/antlr/v4 v4.13.1 // indirect github.com/beorn7/perks v1.0.1 // indirect github.com/blang/semver/v4 v4.0.0 // indirect github.com/cespare/xxhash/v2 v2.3.0 // indirect @@ -76,18 +71,10 @@ require ( github.com/go-openapi/swag v0.25.4 // indirect github.com/go-task/slim-sprig/v3 v3.0.0 // indirect github.com/google/btree v1.1.3 // indirect - github.com/google/cel-go v0.22.0 // indirect - github.com/google/gnostic-models v0.6.9 // indirect - github.com/google/go-cmp v0.6.0 // indirect - github.com/google/gofuzz v1.2.0 // indirect - github.com/google/pprof v0.0.0-20241210010833-40e02aabc2ad // indirect - github.com/gorilla/websocket v1.5.0 // indirect - github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 // indirect github.com/google/cel-go v0.26.0 // indirect github.com/google/gnostic-models v0.7.1 // indirect github.com/google/go-cmp v0.7.0 // indirect github.com/google/pprof v0.0.0-20260115054156-294ebfa9ad83 // indirect - github.com/google/uuid v1.6.0 // indirect github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674 // indirect github.com/grpc-ecosystem/grpc-gateway/v2 v2.27.1 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect diff --git a/operator/src/go.sum b/operator/src/go.sum index 81a4e7a2..d5b110cd 100644 --- a/operator/src/go.sum +++ b/operator/src/go.sum @@ -1,11 +1,7 @@ -cel.dev/expr v0.19.0 h1:lXuo+nDhpyJSpWxpPVi5cPUwzKb+dsdOiw6IreM5yt0= -cel.dev/expr v0.19.0/go.mod h1:MrpN08Q+lEBs+bGYdLxxHkZoUSsCp0nSKTs0nTymJgw= -code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c h1:5eeuG0BHx1+DHeT3AP+ISKZ2ht1UjGhm581ljqYpVeQ= -code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c/go.mod h1:QD9Lzhd/ux6eNQVUDVRJX/RKTigpewimNYBi7ivZKY8= -github.com/antlr4-go/antlr/v4 v4.13.0 h1:lxCg3LAv+EUK6t1i0y1V6/SLeUi0eKEKdhQAlS8TVTI= -github.com/antlr4-go/antlr/v4 v4.13.0/go.mod h1:pfChB/xh/Unjila75QW7+VU4TSnWnnk9UTnmpPaOR2g= cel.dev/expr v0.24.0 h1:56OvJKSH3hDGL0ml5uSxZmz3/3Pq4tJ+fb1unVLAFcY= cel.dev/expr v0.24.0/go.mod h1:hLPLo1W4QUmuYdA72RBX06QTs6MXw941piREPl3Yfiw= +code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c h1:5eeuG0BHx1+DHeT3AP+ISKZ2ht1UjGhm581ljqYpVeQ= +code.cloudfoundry.org/clock v0.0.0-20180518195852-02e53af36e6c/go.mod h1:QD9Lzhd/ux6eNQVUDVRJX/RKTigpewimNYBi7ivZKY8= github.com/Masterminds/semver/v3 v3.4.0 h1:Zog+i5UMtVoCU8oKka5P7i9q9HgrJeGzI9SA1Xbatp0= github.com/Masterminds/semver/v3 v3.4.0/go.mod h1:4V+yj/TJE1HU9XfppCwVMZq3I84lprf4nC11bSS5beM= github.com/antlr4-go/antlr/v4 v4.13.1 h1:SqQKkuVZ+zWkMMNkjy5FZe5mr5WURWnlpmOuzYWrPrQ= @@ -44,10 +40,6 @@ github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/X github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg= github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U= github.com/fsnotify/fsnotify v1.4.7/go.mod h1:jwhsz4b93w/PPRr/qN1Yymfu8t87LnFCMoQvtojpjFo= -github.com/fsnotify/fsnotify v1.7.0 h1:8JEhPFa5W2WU7YfeZzPNqzMP6Lwt7L2715Ggo0nosvA= -github.com/fsnotify/fsnotify v1.7.0/go.mod h1:40Bi/Hjc2AVfZrqy+aj+yEI+/bRxZnMJyTJwOpGvigM= -github.com/fxamacker/cbor/v2 v2.7.0 h1:iM5WgngdRBanHcxugY4JySA0nk1wZorNOpTgCMedv5E= -github.com/fxamacker/cbor/v2 v2.7.0/go.mod h1:pxXPTn3joSm21Gbwsv0w9OSA2y1HFR9qXEeXQVeNoDQ= github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k= github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0= github.com/fxamacker/cbor/v2 v2.9.0 h1:NpKPmjDBgUfBms6tr6JZkTHtfFGcMKsw3eGcmD/sapM= @@ -65,19 +57,6 @@ github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag= github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE= github.com/go-logr/zapr v1.3.0 h1:XGdV8XW8zdwFiwOA2Dryh1gj2KRQyOOoNmBy4EplIcQ= github.com/go-logr/zapr v1.3.0/go.mod h1:YKepepNBd1u/oyhd/yQmtjVXmm9uML4IXUgMOwR8/Gg= -github.com/go-openapi/jsonpointer v0.21.0 h1:YgdVicSA9vH5RiHs9TZW5oyafXZFc6+2Vc1rr/O9oNQ= -github.com/go-openapi/jsonpointer v0.21.0/go.mod h1:IUyH9l/+uyhIYQ/PXVA41Rexl+kOkAPDdXEYns6fzUY= -github.com/go-openapi/jsonreference v0.21.0 h1:Rs+Y7hSXT83Jacb7kFyjn4ijOuVGSvOdF2+tg1TRrwQ= -github.com/go-openapi/jsonreference v0.21.0/go.mod h1:LmZmgsrTkVg9LG4EaHeY8cBDslNPMo06cago5JNLkm4= -github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+GrE= -github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ= -github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= -github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= -github.com/gofrs/uuid v3.3.0+incompatible h1:8K4tyRfvU1CYPgJsveYFQMhpFd/wXNM7iK6rR7UHz84= -github.com/gofrs/uuid v3.3.0+incompatible/go.mod h1:b2aQJv3Z4Fp6yNu3cdSllBxTCLRxnplIgP/c0N/04lM= -github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q= -github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q= -github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= github.com/go-openapi/jsonpointer v0.22.4 h1:dZtK82WlNpVLDW2jlA1YCiVJFVqkED1MegOUy9kR5T4= github.com/go-openapi/jsonpointer v0.22.4/go.mod h1:elX9+UgznpFhgBuaMQ7iu4lvvX1nvNsesQ3oxmYTw80= github.com/go-openapi/jsonreference v0.21.4 h1:24qaE2y9bx/q3uRK/qN+TDwbok1NhbSmGjjySRCHtC8= @@ -116,6 +95,9 @@ github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1v github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= github.com/goccy/go-yaml v1.18.0 h1:8W7wMFS12Pcas7KU+VVkaiCng+kG8QiFeFwzFb+rwuw= github.com/goccy/go-yaml v1.18.0/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA= +github.com/gofrs/uuid v3.3.0+incompatible h1:8K4tyRfvU1CYPgJsveYFQMhpFd/wXNM7iK6rR7UHz84= +github.com/gofrs/uuid v3.3.0+incompatible/go.mod h1:b2aQJv3Z4Fp6yNu3cdSllBxTCLRxnplIgP/c0N/04lM= +github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U= github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek= github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps= github.com/google/btree v1.1.3 h1:CVpQJjYgC4VbzxeGVHfvZrv1ctoYCAI8vbl07Fcxlyg= @@ -133,28 +115,20 @@ github.com/google/pprof v0.0.0-20260115054156-294ebfa9ad83 h1:z2ogiKUYzX5Is6zr/v github.com/google/pprof v0.0.0-20260115054156-294ebfa9ad83/go.mod h1:MxpfABSjhmINe3F1It9d+8exIHFvUqtLIRCdOGNXqiI= github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= -github.com/gorilla/websocket v1.5.0 h1:PPwGk2jz7EePpoHN/+ClbZu8SPxiqlu12wZP/3sWmnc= -github.com/gorilla/websocket v1.5.0/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= -github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0 h1:bkypFPDjIYGfCYD5mRBvpqxfYX1YCS1PXdKYWi8FsN0= -github.com/grpc-ecosystem/grpc-gateway/v2 v2.20.0/go.mod h1:P+Lt/0by1T8bfcF3z737NnSbmxQAppXMRziHUxPOC8k= -github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU= github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674 h1:JeSE6pjso5THxAzdVpqr6/geYxZytqFMBCOtn/ujyeo= github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674/go.mod h1:r4w70xmWCQKmi1ONH4KIaBptdivuRPyosB9RmPlGEwA= github.com/grpc-ecosystem/grpc-gateway/v2 v2.27.1 h1:X5VWvz21y3gzm9Nw/kaUeku/1+uBhcekkmy4IkffJww= github.com/grpc-ecosystem/grpc-gateway/v2 v2.27.1/go.mod h1:Zanoh4+gvIgluNqcfMVTJueD4wSS5hT7zTt4Mrutd90= +github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU= github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= github.com/joshdk/go-junit v1.0.0 h1:S86cUKIdwBHWwA6xCmFlf3RTLfVXYQfvanM5Uh+K6GE= github.com/joshdk/go-junit v1.0.0/go.mod h1:TiiV0PqkaNfFXjEiyjWM3XXrhVyCa1K4Zfga6W52ung= github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM= github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo= -github.com/kisielk/errcheck v1.5.0/go.mod h1:pFxgyoBC7bSaBwPgfKdkLd5X25qrDl4LWUI2bnpBCr8= -github.com/kisielk/gotool v1.0.0/go.mod h1:XhKaO+MFFWcvkIS/tQcRk01m1F5IRFswLeQ+oQHNcck= -github.com/klauspost/compress v1.17.11 h1:In6xLpyWOi1+C7tXUUWv2ot1QvBjxevKAaI6IXrJmUc= -github.com/klauspost/compress v1.17.11/go.mod h1:pMDklpSncoRMuLFrf1W9Ss9KT+0rH90U12bZKk7uwG0= -github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo= github.com/klauspost/compress v1.18.0 h1:c/Cqfb0r+Yi+JtIEq73FWXVkRonBlf0CRNYc8Zttxdo= github.com/klauspost/compress v1.18.0/go.mod h1:2Pp+KzxcywXVXMr50+X0Q/Lsb43OQHYWRCY2AiWywWQ= +github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo= github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE= github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk= github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ= @@ -165,18 +139,14 @@ github.com/kubernetes-csi/external-snapshotter/client/v8 v8.4.0 h1:bMqrb3UHgHbP+ github.com/kubernetes-csi/external-snapshotter/client/v8 v8.4.0/go.mod h1:E3vdYxHj2C2q6qo8/Da4g7P+IcwqRZyy3gJBzYybV9Y= github.com/kylelemons/godebug v1.1.0 h1:RPNrshWIDI6G2gRW9EHilWtl7Z6Sb1BR0xunSBf0SNc= github.com/kylelemons/godebug v1.1.0/go.mod h1:9/0rRGxNHcop5bhtWyNeEfOS8JIWk580+fNqagV/RAw= -github.com/lib/pq v1.10.9 h1:YXG7RB+JIjhP29X+OtkiDnYaXQwpS4JEWq7dtCCRUEw= -github.com/lib/pq v1.10.9/go.mod h1:AlVN5x4E4T544tWzH6hKfbfQvm3HdbOxrmggDNAPY9o= -github.com/mailru/easyjson v0.9.0 h1:PrnmzHw7262yW8sTBwxi1PdJA3Iw/EKBa8psRf7d9a4= -github.com/mailru/easyjson v0.9.0/go.mod h1:1+xMtQp2MRNVL/V1bOzuP3aP8VNwRW55fQUto+XFtTU= -github.com/microsoft/ApplicationInsights-Go v0.4.4 h1:G4+H9WNs6ygSCe6sUyxRc2U81TI5Es90b2t/MwX5KqY= -github.com/microsoft/ApplicationInsights-Go v0.4.4/go.mod h1:fKRUseBqkw6bDiXTs3ESTiU/4YTIHsQS4W3fP2ieF4U= github.com/lib/pq v1.11.1 h1:wuChtj2hfsGmmx3nf1m7xC2XpK6OtelS2shMY+bGMtI= github.com/lib/pq v1.11.1/go.mod h1:/p+8NSbOcwzAEI7wiMXFlgydTwcgTr3OSKMsD2BitpA= github.com/maruel/natural v1.1.1 h1:Hja7XhhmvEFhcByqDoHz9QZbkWey+COd9xWfCfn1ioo= github.com/maruel/natural v1.1.1/go.mod h1:v+Rfd79xlw1AgVBjbO0BEQmptqb5HvL/k9GRHB7ZKEg= github.com/mfridman/tparse v0.18.0 h1:wh6dzOKaIwkUGyKgOntDW4liXSo37qg5AXbIhkMV3vE= github.com/mfridman/tparse v0.18.0/go.mod h1:gEvqZTuCgEhPbYk/2lS3Kcxg1GmTxxU7kTC8DvP0i/A= +github.com/microsoft/ApplicationInsights-Go v0.4.4 h1:G4+H9WNs6ygSCe6sUyxRc2U81TI5Es90b2t/MwX5KqY= +github.com/microsoft/ApplicationInsights-Go v0.4.4/go.mod h1:fKRUseBqkw6bDiXTs3ESTiU/4YTIHsQS4W3fP2ieF4U= github.com/moby/spdystream v0.5.0 h1:7r0J1Si3QO/kjRitvSLVVFUjxMEb/YLj6S9FF62JBCU= github.com/moby/spdystream v0.5.0/go.mod h1:xBAYlnt/ay+11ShkdFKNAG7LsyK/tmNBVvVOwrfMgdI= github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q= @@ -191,13 +161,9 @@ github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f h1:y5//uYreIhSUg3J github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f/go.mod h1:ZdcZmHo+o7JKHSa8/e818NopupXU1YMK5fe1lsApnBw= github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= github.com/onsi/ginkgo v1.8.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= -github.com/onsi/ginkgo/v2 v2.22.2 h1:/3X8Panh8/WwhU/3Ssa6rCKqPLuAkVY2I0RoyDLySlU= -github.com/onsi/ginkgo/v2 v2.22.2/go.mod h1:oeMosUL+8LtarXBHu/c0bx2D/K9zyQ6uX3cTyztHwsk= -github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY= -github.com/onsi/gomega v1.36.2 h1:koNYke6TVk6ZmnyHrCXba/T/MoLBXFjeC1PtvYgw0A8= -github.com/onsi/gomega v1.36.2/go.mod h1:DdwyADRjrc825LhMEkD76cHR5+pUnjhUN8GlHlRPHzY= github.com/onsi/ginkgo/v2 v2.28.1 h1:S4hj+HbZp40fNKuLUQOYLDgZLwNUVn19N3Atb98NCyI= github.com/onsi/ginkgo/v2 v2.28.1/go.mod h1:CLtbVInNckU3/+gC8LzkGUb9oF+e8W8TdUsxPwvdOgE= +github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY= github.com/onsi/gomega v1.39.1 h1:1IJLAad4zjPn2PsnhH70V4DKRFlrCzGBNrNaru+Vf28= github.com/onsi/gomega v1.39.1/go.mod h1:hL6yVALoTOxeWudERyfppUcZXjMwIMLnuSfruD2lcfg= github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4= @@ -277,64 +243,6 @@ go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto= go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE= go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0= go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y= -go.uber.org/zap v1.27.0 h1:aJMhYGrd5QSmlpLMr2MftRKl7t8J8PTZPA732ud/XR8= -go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E= -golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= -golang.org/x/crypto v0.0.0-20191011191535-87dc89f01550/go.mod h1:yigFU9vqHzYiE8UmvKecakEJjdnWj3jj499lnFckfCI= -golang.org/x/crypto v0.0.0-20200622213623-75b288015ac9/go.mod h1:LzIPMQfyMNhhGPhUkYOs5KpL4U8rLKemX1yGLhDgUto= -golang.org/x/exp v0.0.0-20241004190924-225e2abe05e6 h1:1wqE9dj9NpSm04INVsJhhEUzhuDVjbcyKH91sVyPATw= -golang.org/x/exp v0.0.0-20241004190924-225e2abe05e6/go.mod h1:NQtJDoLvd6faHhE7m4T/1IY708gDefGGjR/iUW8yQQ8= -golang.org/x/mod v0.2.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= -golang.org/x/mod v0.3.0/go.mod h1:s0Qsj1ACt9ePp/hMypM3fl4fZqREWJwdYDEqhRiZZUA= -golang.org/x/net v0.0.0-20180906233101-161cd47e91fd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= -golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= -golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= -golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= -golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= -golang.org/x/net v0.38.0 h1:vRMAPTMaeGqVhG5QyLJHqNDwecKTomGeqbnfZyKlBI8= -golang.org/x/net v0.38.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8= -golang.org/x/oauth2 v0.27.0 h1:da9Vo7/tDv5RH/7nZDz1eMGS/q1Vv1N/7FCrBhI9I3M= -golang.org/x/oauth2 v0.27.0/go.mod h1:onh5ek6nERTohokkhCD/y2cV4Do3fxFHFuAejCkRWT8= -golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= -golang.org/x/sync v0.12.0 h1:MHc5BpPuC30uJk597Ri8TV3CNZcTLu6B6z4lJy+g6Jw= -golang.org/x/sync v0.12.0/go.mod h1:1dzgHSNfp02xaA81J2MS99Qcpr2w7fw1gpm99rleRqA= -golang.org/x/sys v0.0.0-20180909124046-d0be0721c37e/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= -golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= -golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= -golang.org/x/sys v0.31.0 h1:ioabZlmFYtWhL+TRYpcnNlLwhyxaM9kWTDEmfnprqik= -golang.org/x/sys v0.31.0/go.mod h1:BJP2sWEmIv4KK5OTEluFJCKSidICx8ciO85XgH3Ak8k= -golang.org/x/term v0.30.0 h1:PQ39fJZ+mfadBm0y5WlL4vlM7Sx1Hgf13sMIY2+QS9Y= -golang.org/x/term v0.30.0/go.mod h1:NYYFdzHoI5wRh/h5tDMdMqCqPJZEuNqVR5xJLd/n67g= -golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= -golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= -golang.org/x/text v0.23.0 h1:D71I7dUrlY+VX0gQShAThNGHFxZ13dGLBHQLVl1mJlY= -golang.org/x/text v0.23.0/go.mod h1:/BLNzu4aZCJ1+kcD0DNRotWKage4q2rGVAg4o22unh4= -golang.org/x/time v0.9.0 h1:EsRrnYcQiGH+5FfbgvV4AP7qEZstoyrHB0DzarOQ4ZY= -golang.org/x/time v0.9.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM= -golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= -golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= -golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= -golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= -golang.org/x/tools v0.28.0 h1:WuB6qZ4RPCQo5aP3WdKZS7i595EdWqWR8vqJTlwTVK8= -golang.org/x/tools v0.28.0/go.mod h1:dcIOrVd3mfQKTgrDVQHqCPMWy6lnhfhtX3hLXYVLfRw= -golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw= -gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY= -google.golang.org/genproto/googleapis/api v0.0.0-20241202173237-19429a94021a h1:OAiGFfOiA0v9MRYsSidp3ubZaBnteRUyn3xB2ZQ5G/E= -google.golang.org/genproto/googleapis/api v0.0.0-20241202173237-19429a94021a/go.mod h1:jehYqy3+AhJU9ve55aNOaSml7wUXjF9x6z2LcCfpAhY= -google.golang.org/genproto/googleapis/rpc v0.0.0-20241202173237-19429a94021a h1:hgh8P4EuoxpsuKMXX/To36nOFD7vixReXgn8lPGnt+o= -google.golang.org/genproto/googleapis/rpc v0.0.0-20241202173237-19429a94021a/go.mod h1:5uTbfoYQed2U9p3KIj2/Zzm02PYhndfdmML0qC3q3FU= -google.golang.org/grpc v1.70.0 h1:pWFv03aZoHzlRKHWicjsZytKAiYCtNS0dHbXnIdq7jQ= -google.golang.org/grpc v1.70.0/go.mod h1:ofIJqVKDXx/JiXrwr2IG4/zwdH9txy3IlF40RmcJSQw= -google.golang.org/protobuf v1.36.5 h1:tPhr+woSbjfYvY6/GPufUoYizxw1cF/yFoxJ2fmpwlM= -google.golang.org/protobuf v1.36.5/go.mod h1:9fA7Ob0pmnwhb644+1+CVWFRbNajQ6iRojtC/QF5bRE= go.uber.org/zap v1.27.1 h1:08RqriUEv8+ArZRYSTXy1LeBScaMpVSTBhCeaZYfMYc= go.uber.org/zap v1.27.1/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E= go.yaml.in/yaml/v2 v2.4.3 h1:6gvOSjQoTB3vt1l+CU+tSyi/HOjfOjRLJ4YwYZGwRO0= @@ -345,16 +253,20 @@ golang.org/x/exp v0.0.0-20250718183923-645b1fa84792 h1:R9PFI6EUdfVKgwKjZef7QIwGc golang.org/x/exp v0.0.0-20250718183923-645b1fa84792/go.mod h1:A+z0yzpGtvnG90cToK5n2tu8UJVP2XUATh+r+sfOOOc= golang.org/x/mod v0.32.0 h1:9F4d3PHLljb6x//jOyokMv3eX+YDeepZSEo3mFJy93c= golang.org/x/mod v0.32.0/go.mod h1:SgipZ/3h2Ci89DlEtEXWUk/HteuRin+HHhN+WbNhguU= +golang.org/x/net v0.0.0-20180906233101-161cd47e91fd/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4= golang.org/x/net v0.49.0 h1:eeHFmOGUTtaaPSGNmjBKpbng9MulQsJURQUAfUwY++o= golang.org/x/net v0.49.0/go.mod h1:/ysNB2EvaqvesRkuLAyjI1ycPZlQHM3q01F02UY/MV8= golang.org/x/oauth2 v0.34.0 h1:hqK/t4AKgbqWkdkcAeI8XLmbK+4m4G5YeQRrmiotGlw= golang.org/x/oauth2 v0.34.0/go.mod h1:lzm5WQJQwKZ3nwavOZ3IS5Aulzxi68dUSgRHujetwEA= +golang.org/x/sync v0.0.0-20180314180146-1d60e4601c6f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.19.0 h1:vV+1eWNmZ5geRlYjzm2adRgW2/mcpevXNg50YZtPCE4= golang.org/x/sync v0.19.0/go.mod h1:9KTHXmSnoGruLpwFjVSX0lNNA75CykiMECbovNTZqGI= +golang.org/x/sys v0.0.0-20180909124046-d0be0721c37e/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.40.0 h1:DBZZqJ2Rkml6QMQsZywtnjnnGvHza6BTfYFWY9kjEWQ= golang.org/x/sys v0.40.0/go.mod h1:OgkHotnGiDImocRcuBABYBEXf8A9a87e/uXjp9XT3ks= golang.org/x/term v0.39.0 h1:RclSuaJf32jOqZz74CkPA9qFuVTX7vhLlpfj/IGWlqY= golang.org/x/term v0.39.0/go.mod h1:yxzUCTP/U+FzoxfdKmLaA0RV1WgE0VY7hXBwKtY/4ww= +golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= golang.org/x/text v0.33.0 h1:B3njUFyqtHDUI5jMn1YIr5B0IE2U0qck04r6d4KPAxE= golang.org/x/text v0.33.0/go.mod h1:LuMebE6+rBincTi9+xWTY8TztLzKHc/9C1uBCG27+q8= golang.org/x/time v0.14.0 h1:MRx4UaLrDotUKUdCIqzPC48t1Y9hANFKIRpNx+Te8PI= @@ -377,11 +289,9 @@ gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8 gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntNwaWcugrBjAiHlqqRiVk= gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= -gopkg.in/evanphx/json-patch.v4 v4.12.0 h1:n6jtcsulIzXPJaxegRbvFNNrZDjbij7ny3gmSPG+6V4= -gopkg.in/evanphx/json-patch.v4 v4.12.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M= -gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys= gopkg.in/evanphx/json-patch.v4 v4.13.0 h1:czT3CmqEaQ1aanPc5SdlgQrrEIb8w/wwCvWWnfEbYzo= gopkg.in/evanphx/json-patch.v4 v4.13.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M= +gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys= gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc= gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw= diff --git a/operator/src/internal/controller/backup_controller.go b/operator/src/internal/controller/backup_controller.go index 1eb7efbf..905c80fe 100644 --- a/operator/src/internal/controller/backup_controller.go +++ b/operator/src/internal/controller/backup_controller.go @@ -197,7 +197,14 @@ func (r *BackupReconciler) createCNPGBackup(ctx context.Context, backup *dbprevi r.Recorder.Event(backup, "Normal", "BackupInitialized", "Successfully initialized backup") // Track backup creation telemetry - r.trackBackupCreated(ctx, backup, cluster, "on-demand") + // Determine backup type from labels - scheduled backups have the "scheduledbackup" label + backupType := "on-demand" + if backup.Labels != nil { + if v, ok := backup.Labels["scheduledbackup"]; ok && v != "" { + backupType = "scheduled" + } + } + r.trackBackupCreated(ctx, backup, cluster, backupType) // Requeue to check status return ctrl.Result{RequeueAfter: 5 * time.Second}, nil @@ -293,6 +300,10 @@ func (r *BackupReconciler) trackBackupCreated(ctx context.Context, backup *dbpre if cluster.Spec.Backup != nil && cluster.Spec.Backup.RetentionDays > 0 { retentionDays = cluster.Spec.Backup.RetentionDays } + // Check if backup has its own retention override + if backup.Spec.RetentionDays != nil && *backup.Spec.RetentionDays > 0 { + retentionDays = *backup.Spec.RetentionDays + } r.TelemetryMgr.Events.TrackBackupCreated(telemetry.BackupCreatedEvent{ BackupID: backupID, diff --git a/operator/src/internal/controller/documentdb_controller.go b/operator/src/internal/controller/documentdb_controller.go index 720ca822..6a01f2ca 100644 --- a/operator/src/internal/controller/documentdb_controller.go +++ b/operator/src/internal/controller/documentdb_controller.go @@ -70,28 +70,34 @@ var reconcileMutex sync.Mutex // +kubebuilder:rbac:groups="",resources=events,verbs=create;patch // +kubebuilder:rbac:groups="",resources=persistentvolumeclaims,verbs=get;list;watch;create;delete // +kubebuilder:rbac:groups="",resources=persistentvolumes,verbs=get;list;watch;update;patch -func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) { +func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) (result ctrl.Result, retErr error) { reconcileStart := time.Now() reconcileMutex.Lock() defer reconcileMutex.Unlock() logger := log.FromContext(ctx) + // Track reconciliation duration at the end using named return value + defer func() { + r.trackReconcileDuration(ctx, "DocumentDB", "reconcile", time.Since(reconcileStart).Seconds(), retErr == nil) + }() + // Fetch the DocumentDB instance documentdb := &dbpreview.DocumentDB{} - err := r.Get(ctx, req.NamespacedName, documentdb) - if err != nil { + if err := r.Get(ctx, req.NamespacedName, documentdb); err != nil { if errors.IsNotFound(err) { // DocumentDB resource not found, handle cleanup logger.Info("DocumentDB resource not found. Cleaning up associated resources.") - if err := r.cleanupResources(ctx, req); err != nil { - return ctrl.Result{}, err + if cleanupErr := r.cleanupResources(ctx, req); cleanupErr != nil { + retErr = cleanupErr + return result, retErr } - return ctrl.Result{}, nil + return result, nil } logger.Error(err, "Failed to get DocumentDB resource") r.trackReconcileError(ctx, "DocumentDB", req.Name, req.Namespace, "get-resource", err) - return ctrl.Result{}, err + retErr = err + return result, retErr } // Ensure cluster has telemetry ID @@ -101,14 +107,12 @@ func (r *DocumentDBReconciler) Reconcile(ctx context.Context, req ctrl.Request) } } - // Track reconciliation at the end - defer func() { - r.trackReconcileDuration(ctx, "DocumentDB", "reconcile", time.Since(reconcileStart).Seconds(), err == nil) - }() - // Handle finalizer lifecycle (add on create, remove on delete) - if done, result, err := r.reconcileFinalizer(ctx, documentdb); done || err != nil { - return result, err + if done, res, err := r.reconcileFinalizer(ctx, documentdb); done || err != nil { + if err != nil { + retErr = err + } + return res, retErr } replicationContext, err := util.GetReplicationContext(ctx, r.Client, *documentdb) @@ -658,14 +662,17 @@ func (r *DocumentDBReconciler) executeSQLCommand(ctx context.Context, cluster *c } // trackReconcileError tracks reconciliation errors to telemetry. +// Note: ResourceID is omitted to avoid PII - errors are tracked by namespace hash and error type only. func (r *DocumentDBReconciler) trackReconcileError(ctx context.Context, resourceType, resourceName, namespace, errorType string, err error) { if r.TelemetryMgr == nil || !r.TelemetryMgr.IsEnabled() { return } + // Do not include resourceName as it may contain PII (user-provided names) + // Errors can be correlated by namespace_hash + error_type + timestamp r.TelemetryMgr.Events.TrackReconciliationError(telemetry.ReconciliationErrorEvent{ ResourceType: resourceType, - ResourceID: resourceName, // Will be replaced with GUID when available + ResourceID: "", // Omitted to avoid PII - use namespace_hash for correlation NamespaceHash: telemetry.HashNamespace(namespace), ErrorType: errorType, ErrorMessage: sanitizeError(err), @@ -749,17 +756,39 @@ func (r *DocumentDBReconciler) TrackClusterCreated(ctx context.Context, document }) } -// sanitizeError removes potential PII from error messages. +// sanitizeError returns a coarse, non-PII classification of the error. +// Per telemetry spec, we do not include raw error text to avoid leaking PII or sensitive data. func sanitizeError(err error) string { if err == nil { return "" } - msg := err.Error() - // Truncate long messages - if len(msg) > 200 { - msg = msg[:200] + "..." + + // Map well-known Kubernetes/API error types to generic, non-PII messages. + switch { + case errors.IsNotFound(err): + return "resource not found" + case errors.IsAlreadyExists(err): + return "resource already exists" + case errors.IsForbidden(err): + return "forbidden" + case errors.IsUnauthorized(err): + return "unauthorized" + case errors.IsConflict(err): + return "conflict" + case errors.IsTimeout(err): + return "timeout" + case errors.IsInvalid(err): + return "invalid resource" + case errors.IsServerTimeout(err): + return "server timeout" + case errors.IsServiceUnavailable(err): + return "service unavailable" + default: + // Do not include the raw error text to avoid leaking PII or sensitive data. + return "unknown error" } - return msg +} + // reconcilePVRecovery handles recovery from a retained PersistentVolume. // // CNPG only supports recovery from PVC (via VolumeSnapshots.Storage with Kind: PersistentVolumeClaim), diff --git a/operator/src/internal/telemetry/client.go b/operator/src/internal/telemetry/client.go index d449e439..94a3fdb1 100644 --- a/operator/src/internal/telemetry/client.go +++ b/operator/src/internal/telemetry/client.go @@ -205,6 +205,8 @@ func parseInstrumentationKeyFromConnectionString(connStr string) string { // Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;... for _, part := range strings.Split(connStr, ";") { + // Trim whitespace to handle cases like "; InstrumentationKey=..." or copy-paste errors + part = strings.TrimSpace(part) if strings.HasPrefix(part, "InstrumentationKey=") { return strings.TrimPrefix(part, "InstrumentationKey=") } @@ -221,6 +223,8 @@ func parseIngestionEndpointFromConnectionString(connStr string) string { // Connection string format: InstrumentationKey=xxx;IngestionEndpoint=xxx;... for _, part := range strings.Split(connStr, ";") { + // Trim whitespace to handle cases like "; IngestionEndpoint=..." or copy-paste errors + part = strings.TrimSpace(part) if strings.HasPrefix(part, "IngestionEndpoint=") { return strings.TrimPrefix(part, "IngestionEndpoint=") } diff --git a/operator/src/internal/telemetry/utils.go b/operator/src/internal/telemetry/utils.go index 2c95c533..f86ee195 100644 --- a/operator/src/internal/telemetry/utils.go +++ b/operator/src/internal/telemetry/utils.go @@ -55,18 +55,33 @@ func CategorizeScheduleFrequency(cronExpr string) ScheduleFrequency { // Simple heuristics for common patterns minute, hour, dayOfMonth, _, dayOfWeek := parts[0], parts[1], parts[2], parts[3], parts[4] - // Hourly: runs every hour (e.g., "0 * * * *") - if minute != "*" && hour == "*" && dayOfMonth == "*" && dayOfWeek == "*" { + // Check for step expressions (e.g., */5) or ranges (e.g., 1-5) which indicate custom schedules + isSimpleValue := func(s string) bool { + // Returns true only for single numeric values (e.g., "0", "15", "23") + if s == "*" { + return false + } + for _, c := range s { + if c < '0' || c > '9' { + return false // Contains */,-/ or other special chars + } + } + return len(s) > 0 + } + + // Hourly: runs every hour at a fixed minute (e.g., "0 * * * *" or "30 * * * *") + // Must be a simple numeric minute, not */5 or similar + if isSimpleValue(minute) && hour == "*" && dayOfMonth == "*" && dayOfWeek == "*" { return ScheduleFrequencyHourly } - // Daily: runs once per day (e.g., "0 2 * * *") - if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek == "*" { + // Daily: runs once per day at a fixed time (e.g., "0 2 * * *") + if isSimpleValue(minute) && isSimpleValue(hour) && dayOfMonth == "*" && dayOfWeek == "*" { return ScheduleFrequencyDaily } // Weekly: runs once per week (e.g., "0 2 * * 0") - if minute != "*" && hour != "*" && dayOfMonth == "*" && dayOfWeek != "*" { + if isSimpleValue(minute) && isSimpleValue(hour) && dayOfMonth == "*" && isSimpleValue(dayOfWeek) { return ScheduleFrequencyWeekly } From c5b0719d9a2e432ab041df4b4b5f5fe09bf081a7 Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Mon, 9 Mar 2026 13:55:33 -0400 Subject: [PATCH 6/7] Remove accidentally added github-secrets-telemetry-setup.md --- .../designs/github-secrets-telemetry-setup.md | 223 ------------------ 1 file changed, 223 deletions(-) delete mode 100644 docs/designs/github-secrets-telemetry-setup.md diff --git a/docs/designs/github-secrets-telemetry-setup.md b/docs/designs/github-secrets-telemetry-setup.md deleted file mode 100644 index 827f6d60..00000000 --- a/docs/designs/github-secrets-telemetry-setup.md +++ /dev/null @@ -1,223 +0,0 @@ -# GitHub Secrets Setup for Application Insights Telemetry - -This document describes how to configure GitHub secrets for Application Insights telemetry collection in the DocumentDB Kubernetes Operator CI/CD pipeline. - -## Overview - -The DocumentDB Operator uses Application Insights to collect anonymous telemetry data about operator usage patterns. This helps the team understand: - -- How many people use the operator -- Which cloud providers are most common (AKS, EKS, GKE) -- Common cluster configurations -- Error patterns and operational issues - -To enable telemetry in CI/CD workflows, you need to configure a GitHub secret containing the Application Insights connection string. - -## Prerequisites - -1. An Azure Application Insights resource -2. Admin access to the GitHub repository (to create secrets) - -## Step 1: Create Application Insights Resource - -If you don't have an Application Insights resource, create one in Azure: - -```bash -# Create a resource group (if needed) -az group create --name documentdb-telemetry-rg --location eastus2 - -# Create Application Insights resource -az monitor app-insights component create \ - --app documentdb-operator-telemetry \ - --location eastus2 \ - --resource-group documentdb-telemetry-rg \ - --kind web \ - --application-type web -``` - -## Step 2: Get the Connection String - -Retrieve the connection string from your Application Insights resource: - -```bash -az monitor app-insights component show \ - --app documentdb-operator-telemetry \ - --resource-group documentdb-telemetry-rg \ - --query connectionString \ - --output tsv -``` - -The connection string will look like: -``` -InstrumentationKey=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx;IngestionEndpoint=https://eastus2-2.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus2.livediagnostics.monitor.azure.com/;ApplicationId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx -``` - -## Step 3: Create GitHub Secret - -### Via GitHub UI - -1. Navigate to your repository on GitHub -2. Go to **Settings** → **Secrets and variables** → **Actions** -3. Click **New repository secret** -4. Set the following: - - **Name**: `APPINSIGHTS_CONNECTION_STRING` - - **Secret**: Paste the connection string from Step 2 -5. Click **Add secret** - -### Via GitHub CLI - -```bash -# Authenticate with GitHub CLI (if not already) -gh auth login - -# Set the secret -gh secret set APPINSIGHTS_CONNECTION_STRING --body "InstrumentationKey=xxx;IngestionEndpoint=https://..." -``` - -## Step 4: Use the Secret in GitHub Actions - -Reference the secret in your GitHub Actions workflow: - -```yaml -# .github/workflows/test-integration.yml -name: Integration Tests - -on: - push: - branches: [main] - pull_request: - branches: [main] - -jobs: - test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - - name: Build and Deploy Operator - env: - APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.APPINSIGHTS_CONNECTION_STRING }} - run: | - # Build operator image with telemetry enabled - make docker-build - - # Deploy with telemetry connection string - helm install documentdb-operator ./operator/documentdb-helm-chart \ - --set telemetry.enabled=true \ - --set telemetry.connectionString="${APPLICATIONINSIGHTS_CONNECTION_STRING}" -``` - -## Step 5: Helm Chart Configuration - -The operator Helm chart supports telemetry configuration via these values: - -```yaml -# values.yaml -telemetry: - # Enable/disable telemetry collection - enabled: true - - # Option 1: Direct connection string (for CI/CD, injected from secrets) - connectionString: "" - - # Option 2: Instrumentation key only - instrumentationKey: "" - - # Option 3: Use an existing Kubernetes secret - existingSecret: "" -``` - -### CI/CD Deployment Example - -```bash -# Deploy with connection string from environment variable -helm upgrade --install documentdb-operator ./operator/documentdb-helm-chart \ - --namespace documentdb-operator \ - --create-namespace \ - --set telemetry.enabled=true \ - --set "telemetry.connectionString=${APPLICATIONINSIGHTS_CONNECTION_STRING}" -``` - -## Alternative: Using Kubernetes Secrets - -For production deployments, you may want to store the connection string in a Kubernetes secret: - -```yaml -# Create a secret with the connection string -apiVersion: v1 -kind: Secret -metadata: - name: documentdb-telemetry-secret - namespace: documentdb-operator -type: Opaque -stringData: - APPLICATIONINSIGHTS_CONNECTION_STRING: "InstrumentationKey=xxx;IngestionEndpoint=https://..." -``` - -Then reference it in Helm: - -```yaml -# values.yaml -telemetry: - enabled: true - existingSecret: "documentdb-telemetry-secret" -``` - -## Security Considerations - -1. **Secret Rotation**: Rotate the Application Insights key periodically -2. **Scope**: Use repository-level secrets (not organization-level) for better isolation -3. **Access Control**: Limit who can view/edit repository secrets -4. **Environment-Specific**: Consider using different App Insights resources for dev/prod - -## Verifying Telemetry Collection - -After deployment, verify telemetry is being sent: - -1. Check operator logs for telemetry transmission: - ```bash - kubectl logs -n documentdb-operator deployment/documentdb-operator | grep -i "telemetry\|appinsights" - ``` - -2. Query Application Insights: - ```kusto - // In Azure Portal > Application Insights > Logs - customEvents - | where timestamp > ago(1h) - | where name startswith "documentdb" - | summarize count() by name - ``` - -## Troubleshooting - -### Telemetry Not Appearing - -1. Verify the connection string is correct: - ```bash - echo $APPLICATIONINSIGHTS_CONNECTION_STRING | grep "InstrumentationKey=" - ``` - -2. Check if the operator has the environment variable: - ```bash - kubectl exec -n documentdb-operator deployment/documentdb-operator -- env | grep APPINSIGHTS - ``` - -3. Check operator logs for errors: - ```bash - kubectl logs -n documentdb-operator deployment/documentdb-operator | grep -i error - ``` - -### Ingestion Delays - -Application Insights has a batching interval (default: 30 seconds). Events may take up to a few minutes to appear in the portal. - -## Data Privacy - -The telemetry system is designed with privacy in mind: - -- **No PII**: No cluster names, namespaces, IP addresses, or user-provided identifiers -- **Hashed namespaces**: Namespace names are SHA-256 hashed -- **GUIDs for correlation**: Auto-generated GUIDs are used instead of resource names -- **Categorized errors**: Error messages are categorized, not raw strings - -See [appinsights-metrics.md](./appinsights-metrics.md) for the complete list of collected telemetry. From c3b92fe662d348dc2bc049baf040fcfe75a57858 Mon Sep 17 00:00:00 2001 From: Ritvik Jayaswal Date: Mon, 9 Mar 2026 13:58:08 -0400 Subject: [PATCH 7/7] feat: Inject App Insights connection string in release workflow - Add step to inject APPINSIGHTS_CONNECTION_STRING secret into Helm chart values.yaml - Only runs when the secret is configured in the repository - Enables automatic telemetry for all release users without exposing secrets in source code --- .github/workflows/release_images.yml | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/.github/workflows/release_images.yml b/.github/workflows/release_images.yml index 6a6f5ce3..7d8e2611 100644 --- a/.github/workflows/release_images.yml +++ b/.github/workflows/release_images.yml @@ -97,6 +97,13 @@ jobs: echo "Updated values.yaml content:" cat operator/documentdb-helm-chart/values.yaml + - name: Inject telemetry connection string + if: ${{ secrets.APPINSIGHTS_CONNECTION_STRING != '' }} + run: | + echo "Injecting Application Insights connection string for telemetry" + # Use yq to update the connectionString field in values.yaml + sed -i 's|connectionString: ""|connectionString: "${{ secrets.APPINSIGHTS_CONNECTION_STRING }}"|g' operator/documentdb-helm-chart/values.yaml + - name: Set chart version run: | echo "CHART_VERSION=${{ github.event.inputs.version }}" >> $GITHUB_ENV