|
| 1 | +# AEP-8026: Allow per-VPA component configuration parameters |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Summary](#summary) |
| 5 | +- [Motivation](#motivation) |
| 6 | + - [Goals](#goals) |
| 7 | + - [Non-Goals](#non-goals) |
| 8 | +- [Proposal](#proposal) |
| 9 | + - [Parameter Descriptions](#parameter-descriptions) |
| 10 | + - [Container Policy Parameters](#container-policy-parameters) |
| 11 | + - [Update Policy Parameters](#update-policy-parameters) |
| 12 | +- [Design Details](#design-details) |
| 13 | + - [API Changes](#api-changes) |
| 14 | + - [Phase 1 (Current Proposal)](#phase-1-current-proposal) |
| 15 | + - [Future Extensions](#future-extensions) |
| 16 | + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) |
| 17 | + - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) |
| 18 | + - [Kubernetes version compatibility](#kubernetes-version-compatibility) |
| 19 | + - [Validation via CEL and Testing](#validation-via-cel-and-testing) |
| 20 | + - [Test Plan](#test-plan) |
| 21 | +- [Implementation History](#implementation-history) |
| 22 | +- [Future Work](#future-work) |
| 23 | +- [Alternatives](#alternatives) |
| 24 | + - [Multiple VPA Deployments](#multiple-vpa-deployments) |
| 25 | + - [Environment-Specific Configuration](#environment-specific-configuration) |
| 26 | +<!-- /toc --> |
| 27 | + |
| 28 | +## Summary |
| 29 | + |
| 30 | +Currently, VPA components (recommender, updater, admission controller) are configured through global flags. This makes it challenging to support different workloads with varying resource optimization needs within the same cluster. This proposal introduces the ability to specify configuration parameters at the individual VPA object level, allowing for workload-specific optimization strategies. |
| 31 | + |
| 32 | +## Motivation |
| 33 | + |
| 34 | +Different types of workloads in a Kubernetes cluster often have different resource optimization requirements. For example: |
| 35 | +- Batch processing jobs might benefit from aggressive OOM handling and frequent adjustments |
| 36 | +- User-facing services might need more conservative growth patterns for stability |
| 37 | +- Development environments might need different settings than production |
| 38 | + |
| 39 | +Currently, supporting these different needs requires running multiple VPA component instances with different configurations, which increases operational complexity and resource usage. |
| 40 | + |
| 41 | +### Goals |
| 42 | + |
| 43 | +- Allow specification of component-specific parameters in individual VPA objects |
| 44 | +- Support different optimization strategies for different workloads in the same cluster |
| 45 | +- Maintain backward compatibility with existing global configuration |
| 46 | +- Initially support the following parameters: |
| 47 | + - oomBumpUpRatio |
| 48 | + - oomMinBumpUp |
| 49 | + - memoryAggregationInterval |
| 50 | + - memoryAggregationIntervalCount |
| 51 | + - evictAfterOOMThreshold |
| 52 | + |
| 53 | +### Non-Goals |
| 54 | + |
| 55 | +- Converting all existing VPA flags to per-object configuration |
| 56 | +- Changing the core VPA algorithm or its decision-making process |
| 57 | +- Adding new optimization strategies |
| 58 | + |
| 59 | +## Proposal |
| 60 | + |
| 61 | +The configuration will be split into two sections: container-specific recommendations under `containerPolicies` and updater configuration under `updatePolicy`. This structure is designed to be extensible, allowing for additional parameters to be added in future iterations of this enhancement. |
| 62 | + |
| 63 | +```yaml |
| 64 | +apiVersion: autoscaling.k8s.io/v1 |
| 65 | +kind: VerticalPodAutoscaler |
| 66 | +metadata: |
| 67 | + name: oom-test-vpa |
| 68 | +spec: |
| 69 | + targetRef: |
| 70 | + apiVersion: apps/v1 |
| 71 | + kind: Deployment |
| 72 | + name: oom-test |
| 73 | + updatePolicy: |
| 74 | + updateMode: Auto |
| 75 | + evictAfterOOMThreshold: "5m" |
| 76 | + resourcePolicy: |
| 77 | + containerPolicies: |
| 78 | + - containerName: "*" |
| 79 | + oomBumpUpRatio: "1.5" |
| 80 | + oomMinBumpUp: 104857600 |
| 81 | + memoryAggregationInterval: "12h" |
| 82 | + memoryAggregationIntervalCount: 5 |
| 83 | +``` |
| 84 | +
|
| 85 | +### Parameter Descriptions |
| 86 | +
|
| 87 | +#### Container Policy Parameters |
| 88 | +#### Container Policy Parameters |
| 89 | +* `oomBumpUpRatio` (Quantity): |
| 90 | + - Multiplier applied to memory recommendations after OOM events |
| 91 | + - Represented as a Quantity (e.g., "1.5") |
| 92 | + - Must be greater than 1 |
| 93 | + - Controls how aggressively memory is increased after container crashes |
| 94 | + |
| 95 | +* `oomMinBumpUp` (bytes): |
| 96 | + - Minimum absolute memory increase after OOM events |
| 97 | + - Ensures meaningful increases even for small containers |
| 98 | + |
| 99 | +* `memoryAggregationInterval` (duration): |
| 100 | + - Time window for aggregating memory usage data |
| 101 | + - Affects how quickly VPA responds to memory usage changes |
| 102 | + |
| 103 | +* `memoryAggregationIntervalCount` (integer): |
| 104 | + - Number of consecutive memory aggregation intervals |
| 105 | + - Used to calculate the total memory aggregation window length |
| 106 | + - Total window length = memoryAggregationInterval * memoryAggregationIntervalCount |
| 107 | + |
| 108 | +#### Update Policy Parameters |
| 109 | +* `evictAfterOOMThreshold` (duration): |
| 110 | + - Time to wait after OOM before considering pod eviction |
| 111 | + - Helps prevent rapid eviction cycles while maintaining stability |
| 112 | + |
| 113 | +Each parameter can be configured independently, falling back to global defaults if not specified. Values should be chosen based on workload characteristics and stability requirements. |
| 114 | + |
| 115 | +## Design Details |
| 116 | + |
| 117 | +### API Changes |
| 118 | + |
| 119 | +#### Phase 1 (Current Proposal) |
| 120 | + |
| 121 | +Extend `ContainerResourcePolicy` with: |
| 122 | +* `oomBumpUpRatio` |
| 123 | +* `oomMinBumpUp` |
| 124 | +* `memoryAggregationInterval` |
| 125 | +* `memoryAggregationIntervalCount` |
| 126 | + |
| 127 | +Extend `PodUpdatePolicy` with: |
| 128 | +* `evictAfterOOMThreshold` |
| 129 | + |
| 130 | +#### Future Extensions |
| 131 | + |
| 132 | +This AEP will be updated as additional parameters are identified for per-object configuration. Potential candidates include: |
| 133 | +* `confidenceIntervalCPU` |
| 134 | +* `confidenceIntervalMemory` |
| 135 | +* `recommendationMarginFraction` |
| 136 | +* Other parameters that benefit from workload-specific tuning |
| 137 | + |
| 138 | +### Feature Enablement and Rollback |
| 139 | + |
| 140 | +#### How can this feature be enabled / disabled in a live cluster? |
| 141 | + |
| 142 | +- Feature gate name: `PerVPAConfig` |
| 143 | +- Default: Off (Alpha) |
| 144 | +- Components depending on the feature gate: |
| 145 | + - admission-controller |
| 146 | + - recommender |
| 147 | + - updater |
| 148 | + |
| 149 | +The feature gate will remain in alpha (default off) until: |
| 150 | +- All planned configuration parameters have been implemented and tested |
| 151 | +- Performance impact has been thoroughly evaluated |
| 152 | +- Documentation is complete for all parameters |
| 153 | + |
| 154 | +Disabling of feature gate `PerVPAConfig` will cause the following to happen: |
| 155 | + |
| 156 | +- Any per-VPA configuration parameters specified in VPA objects will be ignored |
| 157 | +- Components will fall back to using their global configuration values |
| 158 | + |
| 159 | +Enabling of feature gate `PerVPAConfig` will cause the following to happen: |
| 160 | + |
| 161 | +- VPA components will honor the per-VPA configuration parameters specified in VPA objects |
| 162 | +- Validation will be performed on the configuration parameters |
| 163 | +- Configuration parameters will override global defaults for the specific VPA object |
| 164 | + |
| 165 | +### Kubernetes version compatibility |
| 166 | + |
| 167 | +The `PerVPAConfig` feature requires VPA version 1.5.0 or higher. The feature is being introduced as alpha and will follow the standard Kubernetes feature gate graduation process: |
| 168 | +- Alpha: v1.5.0 (default off) |
| 169 | +- Beta: TBD (default on) |
| 170 | +- GA: TBD (default on) |
| 171 | + |
| 172 | +### Validation via CEL and Testing |
| 173 | + |
| 174 | +Initial validation rules (CEL): |
| 175 | +* `oomMinBumpUp` > 0 |
| 176 | +* `memoryAggregationInterval` > 0 |
| 177 | +* `evictAfterOOMThreshold` > 0 |
| 178 | +* `memoryAggregationIntervalCount` > 0 |
| 179 | + |
| 180 | +Validation via Admission Controller: |
| 181 | +Some components cann't be validated using Common Expression Language (CEL). This validation is performed within the admission controller. |
| 182 | + |
| 183 | +* `oomBumpUpRatio` – Using Kubernetes Quantity type for validation. The value must be greater than 1. |
| 184 | + |
| 185 | +Additional validation rules will be added as new parameters are introduced. |
| 186 | +E2E tests will be included to verify: |
| 187 | +* Different configurations are properly applied and respected by VPA components |
| 188 | +* VPA behavior matches expected outcomes for different parameter combinations |
| 189 | +* Proper fallback to global configuration when parameters are not specified |
| 190 | + |
| 191 | +### Test Plan |
| 192 | + |
| 193 | +- Unit tests for new API fields and validation |
| 194 | +- Integration tests verifying different configurations are properly applied |
| 195 | +- E2E tests comparing behavior with different configurations |
| 196 | +- Upgrade tests ensuring backward compatibility |
| 197 | + |
| 198 | +## Implementation History |
| 199 | + |
| 200 | +- 2025-04-12: Initial proposal |
| 201 | +- Future: Additional parameters will be added based on user feedback and requirements |
| 202 | + |
| 203 | +## Future Work |
| 204 | + |
| 205 | +This enhancement is designed to be extensible. As the VPA evolves and users provide feedback, additional parameters may be added to the per-object configuration. Each new parameter will: |
| 206 | +1. Be documented in this AEP |
| 207 | +2. Include appropriate validation rules |
| 208 | +3. Maintain backward compatibility |
| 209 | +4. Follow the same pattern of falling back to global configuration when not specified |
| 210 | + |
| 211 | +The decision to add new parameters will be based on: |
| 212 | +- User feedback and requirements |
| 213 | +- Performance impact analysis |
| 214 | +- Implementation complexity |
| 215 | +- Maintenance considerations |
| 216 | + |
| 217 | +## Alternatives |
| 218 | + |
| 219 | +### Multiple VPA Deployments |
| 220 | + |
| 221 | +Continue with current approach of running multiple VPA deployments with different configurations: |
| 222 | +- Pros: No API changes needed |
| 223 | +- Cons: Higher resource usage, operational complexity |
| 224 | + |
| 225 | +### Environment-Specific Configuration |
| 226 | + |
| 227 | +Use different VPA deployments per environment (dev/staging/prod): |
| 228 | +- Pros: Simpler than per-workload configuration |
| 229 | +- Cons: Less flexible, doesn't address varying needs within same environment |
0 commit comments