Skip to content

Conversation

@ntnn
Copy link
Member

@ntnn ntnn commented Dec 19, 2025

Summary

Add an initial implementation of a kcp native garbage collector.

Only enabled for e2e tests for now, e2e-shared and -sharded produced some odd errors.
Enabling for e2e is done in a separate commit for bisecting.

There's still blind spots and unhandled cases (e.g. blocking owner deletion), but these need tests added.
And the PR is already relatively sizeable and I'd rather iterate off of this going forward instead of making one massive PR.

Cross-cluster ownership is technically possible but needs adjustments in the kube fork to allow owner references to include the cluster name.

What Type of PR Is This?

/kind feature

Related Issue(s)

Fixes #

Release Notes

Not adding a release note yet. I'd leave that for when the gc is ready to be tested in production-like environments.

NONE

Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
@kcp-ci-bot kcp-ci-bot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. dco-signoff: yes Indicates the PR's author has signed the DCO. labels Dec 19, 2025
@kcp-ci-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ntnn. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kcp-ci-bot kcp-ci-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 19, 2025
ntnn added 5 commits December 19, 2025 12:53
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
Signed-off-by: Nelo-T. Wallus <red.brush9525@fastmail.com>
Signed-off-by: Nelo-T. Wallus <n.wallus@sap.com>
@ntnn
Copy link
Member Author

ntnn commented Dec 19, 2025

/test pull-kcp-test-integration

1 similar comment
@ntnn
Copy link
Member Author

ntnn commented Dec 19, 2025

/test pull-kcp-test-integration

)

func (gc *GarbageCollector) registerHandlers(ctx context.Context) func() {
// TODO(ntnn): Handle sharding? Could add a filter on all watches to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this even matters? If we are dealing with local shard informer, it will not even see logicalcluster from other shardS?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; I was originally thinking of watching all logical clusters and wrote the comment based on that. If we only need cross-cluster ownership for resources in the cache that is moot.

@kcp-ci-bot
Copy link
Contributor

@ntnn: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kcp-test-e2e bf6e24b link true /test pull-kcp-test-e2e
pull-kcp-test-e2e-multiple-runs bf6e24b link true /test pull-kcp-test-e2e-multiple-runs

Full PR test history

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@mjudeikis mjudeikis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did quick read

"github.com/kcp-dev/kcp/pkg/reconciler/garbagecollector/syncmap"
)

type Graph struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this is upsteam-looks like and how much is our custom stuff? Would be nice to have some mini readme ontop how data is stored, and if some of these are "borowoed" from upstream - links to original source

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty much all custom stuff; I originally wanted to reuse code from upstream but the upstream code is very tightly knit. My initial attempt was to start a graph builder per logical cluster with the same queues, feeding that into a single garbage collector with modifications for cluster-awareness; but that needed so many adjustments that it'd be a nightmare to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some more docs

@gman0
Copy link
Contributor

gman0 commented Dec 29, 2025

There are a couple gc.log.* lines that are a bit too verbose - printing full (partial metadata) objects. Debugging leftovers?

@gman0
Copy link
Contributor

gman0 commented Dec 29, 2025

Also, I'm getting pretty consistent failures with:

./bin/kcp start --feature-gates=WorkspaceMounts=true,CacheAPIs=true,WorkspaceAuthentication=true,KcpNativeGarbageCollector=true

and

go test -p 1 -v -run '^TestGarbageCollectorNormalCRDs$' ./test/e2e/garbagecollector/... --kcp-kubeconfig=$(pwd)/.kcp/admin.kubeconfig

First run succeeds (probably always), the second time (probably always) fails.

// The graph add/update events are not queued into their own worker
// queue because updating the graph cannot fail and is reasonably
// fast.
handlers := cache.ResourceEventHandlerFuncs{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're inserting all objects in a shard into the GC graph. Can they be filtered and inserted only if they actually contain owner references?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, they're references, not the objects themselves. I was thinking of making so a node in the graph only exists when it the respective object owns other objects but decided to leave that for later because I don't think it makes much of a difference in the initial implementation.

Memory-wise - if an object doesn't own any other objects its node in the graph will just be an empty slice.

Performance-wise it might be better because the underlying sync.Map is better with fewer writes and many reads. So not writing empty nodes would be preferable to not busy the underlying sync.Map when the information isn't needed. But I'd prefer to benchmark that to see how much of an impact that will have (also to compare the hashtriemap vs other potential implementations [like a simple map with a lock :D]; though I strongly suspect the hashtriemap will be the best option).

Copy link
Contributor

@gman0 gman0 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a node in the graph only exists when it the respective object owns other objects

Yes that's what I meant (or inverse of that) :D but fair enough, this can be left for a next PR.

For the rest, I agree -- benchmarks needed. Atomics have their costs too, so it would be interesting to see what is the difference.

Copy link
Contributor

@gman0 gman0 Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ a benchmark of this GC versus the stock one.

@xrstf
Copy link
Contributor

xrstf commented Jan 5, 2026

Can you link to some design doc or tickets, to give this more context? Or maybe with a broad brush describe the problems this is solving, the goal and assumptions you made?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the DCO. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants