You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nexus's sync_switch_configuration bg task periodically updates the SystemNetworkingConfig kept in the replicated bootstore (today used for "early networking" / cold boot; in the post-#10167 world, it will be used for networking configuration all the time). The data in the bootstore has a generation number; however, Nexus is not using generation numbers correctly. Each Nexus makes a local decision about whether to bump the generation number based solely on "does the config I computed match the most recent config kept in the bootstore":
// * If the config we've built from the switchport configuration
// information is different from the last config we've cached
// in the db, we update the config, cache it in the db, and
// apply it.
This allows two Nexuses to duel. An example:
Config in the bootstore is currently on generation 10.
Nexus A activates its sync_switch_configuration bg task, computes a new config, but then stalls somewhere in between that and checking whether to update the bootstore.
Something in the configuration changes.
Nexus B activates its sync_switch_configuration bg task, computes the (correct) change, and writes generation 11 to the bootstore.
Nexus A resumes: it's computed a config that matched generation 10, but now when it compares against the bootstore, it sees a different value, so believes it's supposed to bump the generation to 12. It does so, and overwrites the correct config (written by Nexus B in the previous step) with the old config from generation 10.
This is eventually consistent: the next time a Nexus runs, it will recompute the correct config, see that it doesn't match, and write a generation 13 that is correct.
I made this problem worse in #10219: sync_switch_configuration now inspects the current target blueprint to fill in the service zone NAT entries, but similarly to the ordering above, a Nexus looking at an old target can overwrite a config written by a Nexus that used a newer target.
I think tackling this will require a couple different projects:
Management of network configuration needs a nontrivial overhaul. @internet-diglett and I chatted about this a bit briefly, and Nexus panic inside sync_switch_configuration background task #8579 and Do not panic in switch synchronization task #8714 are closely related to this: as this task is building up the config, it makes many independent (non-transactional) reads of the db and assumes it's getting a coherent view. If the config changes in between some of those reads, that view is not coherent, and may cause an incorrect config, the task to panic, or other badness. The task needs a way to (a) get a coherent view of the current config and (b) know whether that view is out of date relative to what's stored in the bootstore.
This issue covers "project 2". A tentative plan for this is:
Add a rendezvous table that blueprint execution can write containing the current set of external networking state of a blueprint.
This table will need a generation number of its own to allow execution of stale blueprints. I think this means the blueprint itself will need to grow at least a separate generation number covering "all the external networking details". (This information is implicit in the contents of all the sled configs, so maybe it's enough to just store a generation number? But it might be clearer to store an entire BlueprintExternalNetworkingConfig structure of some kind. I'll noodle on this.)
Blueprint execution will need to update this table and still update the outside-of-reconfigurator networking tables to keep them in sync with the rendezvous table.
The sync_switch_configuration task should operate exclusively on the rendezvous table, will need to store its generation number in the bootstore config, and can use that generation number to know whether the NAT entry subset of the config needs to be updated.
Nexus's
sync_switch_configurationbg task periodically updates theSystemNetworkingConfigkept in the replicated bootstore (today used for "early networking" / cold boot; in the post-#10167 world, it will be used for networking configuration all the time). The data in the bootstore has a generation number; however, Nexus is not using generation numbers correctly. Each Nexus makes a local decision about whether to bump the generation number based solely on "does the config I computed match the most recent config kept in the bootstore":omicron/nexus/src/app/background/tasks/sync_switch_configuration.rs
Lines 1333 to 1336 in ed5c46a
This allows two Nexuses to duel. An example:
sync_switch_configurationbg task, computes a new config, but then stalls somewhere in between that and checking whether to update the bootstore.sync_switch_configurationbg task, computes the (correct) change, and writes generation 11 to the bootstore.This is eventually consistent: the next time a Nexus runs, it will recompute the correct config, see that it doesn't match, and write a generation 13 that is correct.
I made this problem worse in #10219:
sync_switch_configurationnow inspects the current target blueprint to fill in the service zone NAT entries, but similarly to the ordering above, a Nexus looking at an old target can overwrite a config written by a Nexus that used a newer target.I think tackling this will require a couple different projects:
sync_switch_configurationbackground task #8579 and Do not panic in switch synchronization task #8714 are closely related to this: as this task is building up the config, it makes many independent (non-transactional) reads of the db and assumes it's getting a coherent view. If the config changes in between some of those reads, that view is not coherent, and may cause an incorrect config, the task to panic, or other badness. The task needs a way to (a) get a coherent view of the current config and (b) know whether that view is out of date relative to what's stored in the bootstore.This issue covers "project 2". A tentative plan for this is:
BlueprintExternalNetworkingConfigstructure of some kind. I'll noodle on this.)sync_switch_configurationtask should operate exclusively on the rendezvous table, will need to store its generation number in the bootstore config, and can use that generation number to know whether the NAT entry subset of the config needs to be updated.