[WIP] Implement zero-downtime NRI plugin upgrades by chrishenzie · Pull Request #281 · containerd/nri

chrishenzie · 2026-03-24T00:36:36Z

Track duplicate NRI plugin connections during updates to prevent dropping container lifecycle events. Route requests only to newest connection, determined by socket connection time. Shadow older plugins and skip them in event routing. Remove older plugins when they disconnect (e.g., during DaemonSet rollout).

Tested by running make test.

Assisted-by: Antigravity

Track duplicate NRI plugin connections during updates to prevent dropping container lifecycle events. Route requests only to newest connection, determined by socket connection time. Shadow older plugins and skip them in event routing. Remove older plugins when they disconnect (e.g., during DaemonSet rollout). Tested by running make test. Assisted-by: Antigravity Signed-off-by: Chris Henzie <chrishenzie@gmail.com>

klihub · 2026-03-24T22:01:10Z

@chrishenzie @samuelkarp Thanks for picking this up. I only took a brief look yet, but I like this.

I had implemented something very similar a few months ago, as part of the futile giga-PR #235 which I gave up on since then. Commit f145341 was the one which implemented plugin update there, but it took a slightly different approach. We had another commit there, 031837f, which implemented shutting down plugins using an updated version of the existing (but in the main branch currently unimplemented) Shutdown() protocol RPC call. So when we detected a plugin update we did an explicit Shutdown() request to the old instance, passing it a well-known exported constant ShutdownByOtherInstance = "other plugin instance registered" reason string. The plugin could then examine the reason and understand to go away/disconnect on its own accord instead of us kicking its connection out after a timeout. The other minor difference was that the PR simply reset the plugins event registration mask to 0/empty, to prevent further event delivery to the plugin instead of adding a dedicated shadowed flag, but I think that's a minor detail.

I think we'll anyway need to implement an improved version of Shutdown with a reason string, since we'd need the ability to kick out plugins with an explanation already now (for instance when a plugin is too slow to respond). So since I was planning to split out and file as a separate PR the Shutdown() bits anyway, I was wondering what do you think about doing/adding something similar on top of this, if/once we have such a thing merged.

samuelkarp · 2026-03-24T22:20:27Z

@chrishenzie and I just discussed this; I think it needs a bit of a design doc describing the semantics (especially around determining which plugin is the "newer" one in scenarios where there might be some crashlooping going on -- plugin awareness of the handoff might be important here) before we go forward with it. We don't currently have this in as a 1.0 requirement and I'm not really convinced it needs to be; the existing required plugin annotations cover the downtime case.

klihub · 2026-03-25T14:16:50Z

@chrishenzie and I just discussed this; I think it needs a bit of a design doc describing the semantics (especially around determining which plugin is the "newer" one in scenarios where there might be some crashlooping going on -- plugin awareness of the handoff might be important here) before we go forward with it. We don't currently have this in as a 1.0 requirement and I'm not really convinced it needs to be; the existing required plugin annotations cover the downtime case.

Yes, that was the other question I was about to ask (since we're adding more version information to plugin registration)... how the infra could reasonably determine which instance is newer and if we should add an NRI plugin version field in registration as well. But I didn't ask that in particular because then I realized/now think it's not really about version, since it should be perfectly fine that you are downgrading your plugin, in which case it should be considered the 'newer' instance we want to keep. It's semantically closer to an instance generation than a plugin version that we'd need to decide whether to kick out the existing or reject the registering one...

chrishenzie · 2026-03-25T16:32:44Z

Thank you both for the feedback -- I agree this needs a bit more of a design discussion on desired behavior. I'm going to table this for now (at least move back to draft) and revisit once we've completed the 1.0 deliverables.

Edit: I see this issue is actually related to the "consistent failure characteristics" 1.0 deliverable, so actually we'll want some kind of solution to this problem. I'll shift back to the GitHub issue to discuss options and then we can proceed from there.

samuelkarp · 2026-03-25T23:42:37Z

But I didn't ask that in particular because then I realized/now think it's not really about version, since it should be perfectly fine that you are downgrading your plugin, in which case it should be considered the 'newer' instance we want to keep. It's semantically closer to an instance generation than a plugin version that we'd need to decide whether to kick out the existing or reject the registering one...

Yes, exactly. There are also cases where there might not be a "correct" answer, such as an error a cluster operator might make such as deploying two different daemonsets that have a copy of the same plugin. Which one should win then?

chrishenzie marked this pull request as draft March 24, 2026 00:36

chrishenzie marked this pull request as ready for review March 24, 2026 02:36

chrishenzie requested review from klihub and samuelkarp March 24, 2026 02:37

chrishenzie force-pushed the nri-zero-downtime-upgrade branch from 28db168 to 589e080 Compare March 24, 2026 05:36

chrishenzie marked this pull request as draft March 25, 2026 16:32

chrishenzie changed the title ~~Implement zero-downtime NRI plugin upgrades~~ [WIP] Implement zero-downtime NRI plugin upgrades Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement zero-downtime NRI plugin upgrades#281

[WIP] Implement zero-downtime NRI plugin upgrades#281
chrishenzie wants to merge 1 commit intocontainerd:mainfrom
chrishenzie:nri-zero-downtime-upgrade

chrishenzie commented Mar 24, 2026

Uh oh!

klihub commented Mar 24, 2026 •

edited

Loading

Uh oh!

samuelkarp commented Mar 24, 2026

Uh oh!

klihub commented Mar 25, 2026

Uh oh!

chrishenzie commented Mar 25, 2026 •

edited

Loading

Uh oh!

samuelkarp commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chrishenzie commented Mar 24, 2026

Uh oh!

klihub commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuelkarp commented Mar 24, 2026

Uh oh!

klihub commented Mar 25, 2026

Uh oh!

chrishenzie commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samuelkarp commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

klihub commented Mar 24, 2026 •

edited

Loading

chrishenzie commented Mar 25, 2026 •

edited

Loading