-
Notifications
You must be signed in to change notification settings - Fork 71
Address Multi-Cluster Health Check Configuration Inconsistency #789
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… ServiceExport target groups
…check configuration
…rviceExport target groups
mikestvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent fixes and test improvements.
| path: "/health" | ||
| port: 8080 | ||
| protocol: HTTP | ||
| protocolVersion: HTTP1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it default behaviour having service with HTTP2 and health check with http1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll set this to HTTP2 instead.
|
|
||
| ### Multi-Cluster Health Check Configuration | ||
|
|
||
| In multi-cluster deployments, you can ensure consistent health check configuration across all clusters by applying TargetGroupPolicy to ServiceExport resources. This eliminates the previous limitation where only the cluster containing the route resource would receive the correct health check configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call the previous limitation? unless it was documented somewhere else, I don't think it is necessary to call it out.
| cloud pkg_aws.Cloud | ||
| log gwlog.Logger | ||
| cloud pkg_aws.Cloud | ||
| client client.Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this naming is confusing. Any other name we could give ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup good idea, I improved this now.
|
|
||
| prefix := model.TgNamePrefix(resTargetGroup.Spec) | ||
|
|
||
| // Resolve health check configuration from TargetGroupPolicy using centralized resolver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to my education this is happening in the TG_ manager and also in the TG_syntheziser, can you elaborate what is the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manager
This is the low-level AWS API wrapper that handles direct interactions with AWS VPC Lattice:
- CRUD operations: Create, update, delete target groups via AWS Lattice API
- Resource discovery: List and find existing target groups in AWS
- Validation: Check if target groups match expected specifications
- Health check management: Configure and update health check settings
- Target management: Register/deregister targets from target groups
Synthesizer
This is the orchestration layer that manages the lifecycle and reconciliation logic:
- Reconciliation: Ensures desired state matches actual state
- Garbage collection: Identifies and removes unused/orphaned target groups
- Policy integration: Applies TargetGroupPolicy configurations
- Stack management: Works with the controller's resource stack model
- Cleanup logic: Determines which target groups are safe to delete
The Relationship
The synthesizer uses the manager - it's a layered architecture:
TargetGroupSynthesizer (orchestration/business logic)
↓ calls
TargetGroupManager (AWS API operations)
↓ calls
AWS VPC Lattice APIs
- When you create a ServiceExport or Route, the synthesizer determines what target groups are needed
- The synthesizer calls the manager to actually create/update those target groups in AWS
- The synthesizer also handles cleanup - finding unused target groups and telling the manager to delete them
mikestvz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What type of PR is this?
Bug
Which issue does this PR fix:
N/A - Fixes multi-cluster health check configuration inconsistency where TargetGroupPolicy health check settings are not properly synchronized across all clusters in a multi-cluster deployment.
What does this PR do / Why do we need it:
This PR addresses a critical issue in multi-cluster deployments where TargetGroupPolicy health check configurations are only applied to the cluster containing the HTTPRoute or GRPCRoute resource, while other clusters default to basic HTTP/1 health checks with default settings (URL prefix "/" and standard parameters).
The fix enhances the target group synthesis process to ensure that health check configurations from TargetGroupPolicy resources are consistently applied across all clusters that participate in the multi-cluster service mesh, regardless of where the route resources are deployed.
Key changes:
If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:
Repro steps:
Expected: All clusters should use the same TargetGroupPolicy health check configuration
Actual: Only the route cluster receives the correct configuration
Testing done on this change:
Automation added to e2e:
Yes - Added comprehensive end-to-end tests that verify:
Will this PR introduce any new dependencies?:
No - This PR only enhances existing components and leverages existing policy resolution mechanisms.
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No breaking changes. The enhancement is fully backwards compatible:
Upgrade testing confirmed existing deployments work unchanged after controller upgrade.
Does this PR introduce any user-facing change?:
No user-facing API changes. The enhancement is transparent to users - it automatically synchronizes TargetGroupPolicy health check configurations across clusters without requiring any changes to existing resources or workflows.