Address Multi-Cluster Health Check Configuration Inconsistency #789

ghost · 2025-08-08T18:15:31Z

What type of PR is this?
Bug

Which issue does this PR fix:
N/A - Fixes multi-cluster health check configuration inconsistency where TargetGroupPolicy health check settings are not properly synchronized across all clusters in a multi-cluster deployment.

What does this PR do / Why do we need it:
This PR addresses a critical issue in multi-cluster deployments where TargetGroupPolicy health check configurations are only applied to the cluster containing the HTTPRoute or GRPCRoute resource, while other clusters default to basic HTTP/1 health checks with default settings (URL prefix "/" and standard parameters).

The fix enhances the target group synthesis process to ensure that health check configurations from TargetGroupPolicy resources are consistently applied across all clusters that participate in the multi-cluster service mesh, regardless of where the route resources are deployed.

Key changes:

Enhanced target group manager to resolve TargetGroupPolicy for ServiceExport target groups
Extended policy helper to support service-based policy resolution in addition to ServiceExport resolution
Updated target group synthesizer to use policy-derived health check configuration
Added ServiceExport controller watching for TargetGroupPolicy changes
Implemented health check configuration resolution logic with proper fallback

If an issue # is not available please add repro steps and logs from aws-gateway-controller showing the issue:
Repro steps:

Deploy a 3-cluster setup with HTTPRoute in cluster A and ServiceExports in clusters B and C
Apply a TargetGroupPolicy with custom health check configuration (e.g., custom path, HTTP/2 protocol)
Observe that only cluster A's target group receives the custom health check configuration
Clusters B and C target groups use default HTTP/1 health checks with "/" path

Expected: All clusters should use the same TargetGroupPolicy health check configuration
Actual: Only the route cluster receives the correct configuration

Testing done on this change:

Unit tests added for policy resolution logic, target group manager enhancements, and health check configuration resolution
Integration tests added for TargetGroupPolicy application to ServiceExport target groups, policy conflict resolution, and fallback behavior
End-to-end tests added for health check configuration consistency across clusters
Backwards compatibility testing to ensure existing deployments continue to work unchanged
Manual testing in 3-cluster environment with various TargetGroupPolicy configurations

Automation added to e2e:
Yes - Added comprehensive end-to-end tests that verify:

ServiceExport target groups receive correct TargetGroupPolicy health check configuration
Custom health check paths, protocols, and parameters are applied correctly
Policy changes update existing target group configurations
Backwards compatibility with deployments without policies

Will this PR introduce any new dependencies?:
No - This PR only enhances existing components and leverages existing policy resolution mechanisms.

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No breaking changes. The enhancement is fully backwards compatible:

Existing target groups continue to function with current health check configurations
When no TargetGroupPolicy is present, target groups use the same default configuration as before
Existing TargetGroupPolicy resources continue to work exactly as before
No existing API contracts or resource specifications are changed

Upgrade testing confirmed existing deployments work unchanged after controller upgrade.

Does this PR introduce any user-facing change?:
No user-facing API changes. The enhancement is transparent to users - it automatically synchronizes TargetGroupPolicy health check configurations across clusters without requiring any changes to existing resources or workflows.

Fix multi-cluster health check configuration inconsistency where TargetGroupPolicy health check settings were not properly synchronized across all clusters in multi-cluster deployments. ServiceExport target groups now correctly inherit health check configuration from applicable TargetGroupPolicy resources instead of defaulting to basic HTTP/1 health checks.

… ServiceExport target groups

…check configuration

…anges

…rviceExport target groups

…anges

mikestvz

Excellent fixes and test improvements.

Makefile

docs/api-types/target-group-policy.md

mikestvz · 2025-08-12T02:08:19Z

docs/api-types/target-group-policy.md

+        path: "/health"
+        port: 8080
+        protocol: HTTP
+        protocolVersion: HTTP1


is it default behaviour having service with HTTP2 and health check with http1?

Good catch, I'll set this to HTTP2 instead.

mikestvz · 2025-08-12T02:10:35Z

docs/guides/advanced-configurations.md


+### Multi-Cluster Health Check Configuration
+
+In multi-cluster deployments, you can ensure consistent health check configuration across all clusters by applying TargetGroupPolicy to ServiceExport resources. This eliminates the previous limitation where only the cluster containing the route resource would receive the correct health check configuration.


Do we need to call the previous limitation? unless it was documented somewhere else, I don't think it is necessary to call it out.

pkg/deploy/lattice/health_check_resolver.go

mikestvz · 2025-08-12T02:17:34Z

pkg/deploy/lattice/target_group_manager.go

-	cloud pkg_aws.Cloud
+	log    gwlog.Logger
+	cloud  pkg_aws.Cloud
+	client client.Client


this naming is confusing. Any other name we could give ?

Yup good idea, I improved this now.

mikestvz · 2025-08-12T02:20:01Z

pkg/deploy/lattice/target_group_synthesizer.go


 		prefix := model.TgNamePrefix(resTargetGroup.Spec)

+		// Resolve health check configuration from TargetGroupPolicy using centralized resolver


to my education this is happening in the TG_ manager and also in the TG_syntheziser, can you elaborate what is the difference?

Manager

This is the low-level AWS API wrapper that handles direct interactions with AWS VPC Lattice:

CRUD operations: Create, update, delete target groups via AWS Lattice API

Resource discovery: List and find existing target groups in AWS

Validation: Check if target groups match expected specifications

Health check management: Configure and update health check settings

Target management: Register/deregister targets from target groups

Synthesizer

This is the orchestration layer that manages the lifecycle and reconciliation logic:

Reconciliation: Ensures desired state matches actual state

Garbage collection: Identifies and removes unused/orphaned target groups

Policy integration: Applies TargetGroupPolicy configurations

Stack management: Works with the controller's resource stack model

Cleanup logic: Determines which target groups are safe to delete

The Relationship

The synthesizer uses the manager - it's a layered architecture:

TargetGroupSynthesizer (orchestration/business logic) ↓ calls TargetGroupManager (AWS API operations) ↓ calls AWS VPC Lattice APIs

When you create a ServiceExport or Route, the synthesizer determines what target groups are needed

The synthesizer calls the manager to actually create/update those target groups in AWS

The synthesizer also handles cleanup - finding unused target groups and telling the manager to delete them

test/pkg/test/framework.go

test/suites/integration/access_log_policy_test.go

test/suites/integration/httproute_header_match_test.go

mikestvz

LGTM

Ryan Lymburner added 10 commits August 5, 2025 14:09

Task 1: Enhance target group manager to resolve TargetGroupPolicy for…

171aa55

… ServiceExport target groups

Task 2: Enhance policy helper to support service-based policy resolution

92ff5c8

Task 3: Update target group synthesizer to use policy-derived health …

afcd1a5

…check configuration

Task 4: Implement health check configuration resolution logic

0a29e1c

Task 5: Update ServiceExport controller to watch TargetGroupPolicy ch…

42944cf

…anges

Task 6: Add unit tests for policy resolution logic

b6151e3

Task 7: Add integration tests for TargetGroupPolicy application to Se…

84461af

…rviceExport target groups

Task 8: Add end-to-end tests for health check configuration consistency

106085f

Task 9: Update documentation to reflect multi-cluster health check ch…

15b0dd0

…anges

Task 10: Address propagation delay in tests

209aa8e

ghost requested a review from mikestvz August 8, 2025 18:15

ghost self-assigned this Aug 8, 2025

ghost added the bug Something isn't working label Aug 8, 2025

Merge branch 'main' into grpc_serviceexport_health

7f08fd9

ghost enabled auto-merge August 8, 2025 18:16

mikestvz reviewed Aug 12, 2025

View reviewed changes

Ryan Lymburner and others added 2 commits August 12, 2025 10:22

Address PR comments.

1a76bfb

Merge branch 'main' into grpc_serviceexport_health

b73cf68

mikestvz approved these changes Aug 12, 2025

View reviewed changes

ghost added this pull request to the merge queue Aug 12, 2025

Merged via the queue into aws:main with commit 528d716 Aug 12, 2025
2 of 3 checks passed

This was referenced Aug 14, 2025

Release v1.1.4 #791

Closed

Release v1.1.4 #792

Merged

ghost deleted the grpc_serviceexport_health branch August 19, 2025 22:36

This pull request was closed.


		### Multi-Cluster Health Check Configuration

		In multi-cluster deployments, you can ensure consistent health check configuration across all clusters by applying TargetGroupPolicy to ServiceExport resources. This eliminates the previous limitation where only the cluster containing the route resource would receive the correct health check configuration.


		prefix := model.TgNamePrefix(resTargetGroup.Spec)

		// Resolve health check configuration from TargetGroupPolicy using centralized resolver

Address Multi-Cluster Health Check Configuration Inconsistency #789

Address Multi-Cluster Health Check Configuration Inconsistency #789

Uh oh!

Conversation

ghost commented Aug 8, 2025

Uh oh!

mikestvz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mikestvz Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ghost Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

mikestvz Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikestvz Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ghost Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

mikestvz Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ghost Aug 12, 2025

Choose a reason for hiding this comment

Manager

Synthesizer

The Relationship

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mikestvz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant