Skip to content

Conversation

@ghost
Copy link

@ghost ghost commented Aug 8, 2025

What type of PR is this?
Bug

Which issue does this PR fix:
N/A - Fixes multi-cluster health check configuration inconsistency where TargetGroupPolicy health check settings are not properly synchronized across all clusters in a multi-cluster deployment.

What does this PR do / Why do we need it:
This PR addresses a critical issue in multi-cluster deployments where TargetGroupPolicy health check configurations are only applied to the cluster containing the HTTPRoute or GRPCRoute resource, while other clusters default to basic HTTP/1 health checks with default settings (URL prefix "/" and standard parameters).

The fix enhances the target group synthesis process to ensure that health check configurations from TargetGroupPolicy resources are consistently applied across all clusters that participate in the multi-cluster service mesh, regardless of where the route resources are deployed.

Key changes:

  • Enhanced target group manager to resolve TargetGroupPolicy for ServiceExport target groups
  • Extended policy helper to support service-based policy resolution in addition to ServiceExport resolution
  • Updated target group synthesizer to use policy-derived health check configuration
  • Added ServiceExport controller watching for TargetGroupPolicy changes
  • Implemented health check configuration resolution logic with proper fallback

If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:
Repro steps:

  1. Deploy a 3-cluster setup with HTTPRoute in cluster A and ServiceExports in clusters B and C
  2. Apply a TargetGroupPolicy with custom health check configuration (e.g., custom path, HTTP/2 protocol)
  3. Observe that only cluster A's target group receives the custom health check configuration
  4. Clusters B and C target groups use default HTTP/1 health checks with "/" path

Expected: All clusters should use the same TargetGroupPolicy health check configuration
Actual: Only the route cluster receives the correct configuration

Testing done on this change:

  • Unit tests added for policy resolution logic, target group manager enhancements, and health check configuration resolution
  • Integration tests added for TargetGroupPolicy application to ServiceExport target groups, policy conflict resolution, and fallback behavior
  • End-to-end tests added for health check configuration consistency across clusters
  • Backwards compatibility testing to ensure existing deployments continue to work unchanged
  • Manual testing in 3-cluster environment with various TargetGroupPolicy configurations

Automation added to e2e:
Yes - Added comprehensive end-to-end tests that verify:

  • ServiceExport target groups receive correct TargetGroupPolicy health check configuration
  • Custom health check paths, protocols, and parameters are applied correctly
  • Policy changes update existing target group configurations
  • Backwards compatibility with deployments without policies

Will this PR introduce any new dependencies?:
No - This PR only enhances existing components and leverages existing policy resolution mechanisms.

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No breaking changes. The enhancement is fully backwards compatible:

  • Existing target groups continue to function with current health check configurations
  • When no TargetGroupPolicy is present, target groups use the same default configuration as before
  • Existing TargetGroupPolicy resources continue to work exactly as before
  • No existing API contracts or resource specifications are changed

Upgrade testing confirmed existing deployments work unchanged after controller upgrade.

Does this PR introduce any user-facing change?:
No user-facing API changes. The enhancement is transparent to users - it automatically synchronizes TargetGroupPolicy health check configurations across clusters without requiring any changes to existing resources or workflows.

Fix multi-cluster health check configuration inconsistency where TargetGroupPolicy health check settings were not properly synchronized across all clusters in multi-cluster deployments. ServiceExport target groups now correctly inherit health check configuration from applicable TargetGroupPolicy resources instead of defaulting to basic HTTP/1 health checks.

@ghost ghost requested a review from mikestvz August 8, 2025 18:15
@ghost ghost self-assigned this Aug 8, 2025
@ghost ghost added the bug Something isn't working label Aug 8, 2025
@ghost ghost enabled auto-merge August 8, 2025 18:16
Copy link
Contributor

@mikestvz mikestvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent fixes and test improvements.

path: "/health"
port: 8080
protocol: HTTP
protocolVersion: HTTP1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it default behaviour having service with HTTP2 and health check with http1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll set this to HTTP2 instead.


### Multi-Cluster Health Check Configuration

In multi-cluster deployments, you can ensure consistent health check configuration across all clusters by applying TargetGroupPolicy to ServiceExport resources. This eliminates the previous limitation where only the cluster containing the route resource would receive the correct health check configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call the previous limitation? unless it was documented somewhere else, I don't think it is necessary to call it out.

cloud pkg_aws.Cloud
log gwlog.Logger
cloud pkg_aws.Cloud
client client.Client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this naming is confusing. Any other name we could give ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup good idea, I improved this now.


prefix := model.TgNamePrefix(resTargetGroup.Spec)

// Resolve health check configuration from TargetGroupPolicy using centralized resolver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to my education this is happening in the TG_ manager and also in the TG_syntheziser, can you elaborate what is the difference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manager

This is the low-level AWS API wrapper that handles direct interactions with AWS VPC Lattice:

  • CRUD operations: Create, update, delete target groups via AWS Lattice API
  • Resource discovery: List and find existing target groups in AWS
  • Validation: Check if target groups match expected specifications
  • Health check management: Configure and update health check settings
  • Target management: Register/deregister targets from target groups

Synthesizer

This is the orchestration layer that manages the lifecycle and reconciliation logic:

  • Reconciliation: Ensures desired state matches actual state
  • Garbage collection: Identifies and removes unused/orphaned target groups
  • Policy integration: Applies TargetGroupPolicy configurations
  • Stack management: Works with the controller's resource stack model
  • Cleanup logic: Determines which target groups are safe to delete

The Relationship

The synthesizer uses the manager - it's a layered architecture:

TargetGroupSynthesizer (orchestration/business logic)
        ↓ calls
TargetGroupManager (AWS API operations)
        ↓ calls  
AWS VPC Lattice APIs
  • When you create a ServiceExport or Route, the synthesizer determines what target groups are needed
  • The synthesizer calls the manager to actually create/update those target groups in AWS
  • The synthesizer also handles cleanup - finding unused target groups and telling the manager to delete them

Copy link
Contributor

@mikestvz mikestvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ghost ghost added this pull request to the merge queue Aug 12, 2025
Merged via the queue into aws:main with commit 528d716 Aug 12, 2025
2 of 3 checks passed
This was referenced Aug 14, 2025
@ghost ghost deleted the grpc_serviceexport_health branch August 19, 2025 22:36
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant