Skip to content

[WIP] Implement zero-downtime NRI plugin upgrades#281

Draft
chrishenzie wants to merge 1 commit intocontainerd:mainfrom
chrishenzie:nri-zero-downtime-upgrade
Draft

[WIP] Implement zero-downtime NRI plugin upgrades#281
chrishenzie wants to merge 1 commit intocontainerd:mainfrom
chrishenzie:nri-zero-downtime-upgrade

Conversation

@chrishenzie
Copy link
Copy Markdown
Member

Track duplicate NRI plugin connections during updates to prevent dropping container lifecycle events. Route requests only to newest connection, determined by socket connection time. Shadow older plugins and skip them in event routing. Remove older plugins when they disconnect (e.g., during DaemonSet rollout).

Tested by running make test.

Assisted-by: Antigravity

@chrishenzie chrishenzie marked this pull request as draft March 24, 2026 00:36
@chrishenzie chrishenzie marked this pull request as ready for review March 24, 2026 02:36
Track duplicate NRI plugin connections during updates to prevent
dropping container lifecycle events. Route requests only to newest
connection, determined by socket connection time. Shadow older
plugins and skip them in event routing. Remove older plugins when
they disconnect (e.g., during DaemonSet rollout).

Tested by running make test.

Assisted-by: Antigravity
Signed-off-by: Chris Henzie <chrishenzie@gmail.com>
@chrishenzie chrishenzie force-pushed the nri-zero-downtime-upgrade branch from 28db168 to 589e080 Compare March 24, 2026 05:36
@klihub
Copy link
Copy Markdown
Member

klihub commented Mar 24, 2026

@chrishenzie @samuelkarp Thanks for picking this up. I only took a brief look yet, but I like this.

I had implemented something very similar a few months ago, as part of the futile giga-PR #235 which I gave up on since then. Commit f145341 was the one which implemented plugin update there, but it took a slightly different approach. We had another commit there, 031837f, which implemented shutting down plugins using an updated version of the existing (but in the main branch currently unimplemented) Shutdown() protocol RPC call. So when we detected a plugin update we did an explicit Shutdown() request to the old instance, passing it a well-known exported constant ShutdownByOtherInstance = "other plugin instance registered" reason string. The plugin could then examine the reason and understand to go away/disconnect on its own accord instead of us kicking its connection out after a timeout. The other minor difference was that the PR simply reset the plugins event registration mask to 0/empty, to prevent further event delivery to the plugin instead of adding a dedicated shadowed flag, but I think that's a minor detail.

I think we'll anyway need to implement an improved version of Shutdown with a reason string, since we'd need the ability to kick out plugins with an explanation already now (for instance when a plugin is too slow to respond). So since I was planning to split out and file as a separate PR the Shutdown() bits anyway, I was wondering what do you think about doing/adding something similar on top of this, if/once we have such a thing merged.

@samuelkarp
Copy link
Copy Markdown
Member

@chrishenzie and I just discussed this; I think it needs a bit of a design doc describing the semantics (especially around determining which plugin is the "newer" one in scenarios where there might be some crashlooping going on -- plugin awareness of the handoff might be important here) before we go forward with it. We don't currently have this in as a 1.0 requirement and I'm not really convinced it needs to be; the existing required plugin annotations cover the downtime case.

@klihub
Copy link
Copy Markdown
Member

klihub commented Mar 25, 2026

@chrishenzie and I just discussed this; I think it needs a bit of a design doc describing the semantics (especially around determining which plugin is the "newer" one in scenarios where there might be some crashlooping going on -- plugin awareness of the handoff might be important here) before we go forward with it. We don't currently have this in as a 1.0 requirement and I'm not really convinced it needs to be; the existing required plugin annotations cover the downtime case.

Yes, that was the other question I was about to ask (since we're adding more version information to plugin registration)... how the infra could reasonably determine which instance is newer and if we should add an NRI plugin version field in registration as well. But I didn't ask that in particular because then I realized/now think it's not really about version, since it should be perfectly fine that you are downgrading your plugin, in which case it should be considered the 'newer' instance we want to keep. It's semantically closer to an instance generation than a plugin version that we'd need to decide whether to kick out the existing or reject the registering one...

@chrishenzie
Copy link
Copy Markdown
Member Author

chrishenzie commented Mar 25, 2026

Thank you both for the feedback -- I agree this needs a bit more of a design discussion on desired behavior. I'm going to table this for now (at least move back to draft) and revisit once we've completed the 1.0 deliverables.

Edit: I see this issue is actually related to the "consistent failure characteristics" 1.0 deliverable, so actually we'll want some kind of solution to this problem. I'll shift back to the GitHub issue to discuss options and then we can proceed from there.

@chrishenzie chrishenzie marked this pull request as draft March 25, 2026 16:32
@chrishenzie chrishenzie changed the title Implement zero-downtime NRI plugin upgrades [WIP] Implement zero-downtime NRI plugin upgrades Mar 25, 2026
@samuelkarp
Copy link
Copy Markdown
Member

But I didn't ask that in particular because then I realized/now think it's not really about version, since it should be perfectly fine that you are downgrading your plugin, in which case it should be considered the 'newer' instance we want to keep. It's semantically closer to an instance generation than a plugin version that we'd need to decide whether to kick out the existing or reject the registering one...

Yes, exactly. There are also cases where there might not be a "correct" answer, such as an error a cluster operator might make such as deploying two different daemonsets that have a copy of the same plugin. Which one should win then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants