On dogfood, currently running 19.3, if I create multiple concurrent requests for instances with more local storage than in total the system has, I can get a saga stuck.
root@oxz_switch0:~# omdb db saga show 30ba6c08-bf69-425d-bf61-a3e9844a1b14
id | current_sec | time_created | name | state
--------------------------------------+--------------------------------------+--------------------------+----------------+---------
30ba6c08-bf69-425d-bf61-a3e9844a1b14 | e52a54b0-4341-4c3f-abf9-2c757cd314e8 | 2026-05-06T13:15:56.606Z | instance-start | Running
DAG: {"end_node":25,"graph":{"edge_property":"directed","edges":[[0,1,null],[1,2,null],[2,3,null],[3,4,null],[4,5,null],[5,6,null],[5,7,null],[5,8,null],[5,9,null]
,[5,10,null],[5,11,null],[5,12,null],[5,13,null],[5,14,null],[5,15,null],[5,16,null],[5,17,null],[6,18,null],[7,18,null],[8,18,null],[9,18,null],[10,18,null],[11,1
8,null],[12,18,null],[13,18,null],[14,18,null],[15,18,null],[16,18,null],[17,18,null],[18,19,null],[19,20,null],[20,21,null],[21,22,null],[22,23,null],[24,0,null],
[23,25,null]],"node_holes":[],"nodes":[{"Action":{"action_name":"instance_start.generate_propolis_id","label":"GeneratePropolisId","name":"propolis_id"}},{"Action"
:{"action_name":"instance_start.alloc_server","label":"AllocServer","name":"sled_id"}},{"Action":{"action_name":"instance_start.alloc_propolis_ip","label":"AllocPr
opolisIp","name":"propolis_ip"}},{"Action":{"action_name":"instance_start.create_vmm_record","label":"CreateVmmRecord","name":"vmm_record"}},{"Action":{"action_nam
e":"instance_start.mark_as_starting","label":"MarkAsStarting","name":"started_record"}},{"Action":{"action_name":"instance_start.list_local_storage","label":"ListL
ocalStorage","name":"local_storage_records"}},{"Action":{"action_name":"instance_start.ensure_local_storage_0","label":"EnsureLocalStorage_0","name":"ensure_local_
storage_0"}},{"Action":{"action_name":"instance_start.ensure_local_storage_1","label":"EnsureLocalStorage_1","name":"ensure_local_storage_1"}},{"Action":{"action_n
ame":"instance_start.ensure_local_storage_2","label":"EnsureLocalStorage_2","name":"ensure_local_storage_2"}},{"Action":{"action_name":"instance_start.ensure_local
_storage_3","label":"EnsureLocalStorage_3","name":"ensure_local_storage_3"}},{"Action":{"action_name":"instance_start.ensure_local_storage_4","label":"EnsureLocalS
torage_4","name":"ensure_local_storage_4"}},{"Action":{"action_name":"instance_start.ensure_local_storage_5","label":"EnsureLocalStorage_5","name":"ensure_local_st
orage_5"}},{"Action":{"action_name":"instance_start.ensure_local_storage_6","label":"EnsureLocalStorage_6","name":"ensure_local_storage_6"}},{"Action":{"action_nam
e":"instance_start.ensure_local_storage_7","label":"EnsureLocalStorage_7","name":"ensure_local_storage_7"}},{"Action":{"action_name":"instance_start.ensure_local_s
torage_8","label":"EnsureLocalStorage_8","name":"ensure_local_storage_8"}},{"Action":{"action_name":"instance_start.ensure_local_storage_9","label":"EnsureLocalSto
rage_9","name":"ensure_local_storage_9"}},{"Action":{"action_name":"instance_start.ensure_local_storage_10","label":"EnsureLocalStorage_10","name":"ensure_local_st
orage_10"}},{"Action":{"action_name":"instance_start.ensure_local_storage_11","label":"EnsureLocalStorage_11","name":"ensure_local_storage_11"}},{"Action":{"action
_name":"instance_start.dpd_ensure","label":"DpdEnsure","name":"dpd_ensure"}},{"Action":{"action_name":"instance_start.v2p_ensure","label":"V2PEnsure","name":"v2p_e
nsure"}},{"Action":{"action_name":"instance_start.ensure_registered","label":"EnsureRegistered","name":"ensure_registered"}},{"Action":{"action_name":"instance_sta
rt.update_multicast_sled_id","label":"UpdateMulticastSledId","name":"multicast_sled_id"}},{"Action":{"action_name":"instance_start.add_virtual_resources","label":"
AddVirtualResources","name":"virtual_resources"}},{"Action":{"action_name":"instance_start.ensure_running","label":"EnsureRunning","name":"ensure_running"}},{"Star
t":{"params":{"db_instance":{"auto_restart":{"cooldown":null,"policy":null},"boot_disk_id":null,"cpu_platform":null,"dst_propolis_id":null,"hostname":"aa11","ident
ity":{"description":"aa11 test host","id":"6ee7bdcb-c816-400e-91a3-9cda6569cedf","name":"aa11","time_created":"2026-05-06T13:15:44.805808Z","time_deleted":null,"ti
me_modified":"2026-05-06T13:15:44.805808Z"},"intended_state":"Running","memory":8589934592,"migration_id":null,"ncpus":8,"nexus_state":"NoVmm","project_id":"759bea
f2-517d-4d24-bc17-1eed69bc8801","propolis_id":null,"state_generation":2,"time_last_auto_restarted":null,"time_state_updated":"2026-05-06T13:15:44.805808Z","updater
_gen":1,"updater_id":null,"user_data":[35,99,108,111,117,100,45,99,111,110,102,105,103,10,109,97,110,97,103,101,95,101,116,99,95,104,111,115,116,115,58,32,116,114,
117,101,10]},"reason":"AutoStart","serialized_authn":{"kind":{"Authenticated":[{"actor":{"SiloUser":{"silo_id":"7bd7623a-68ed-4636-8ecb-b59e3b068787","silo_user_id
":"93521315-630d-439b-b1fc-43a5e68b7c5b"}},"credential_id":"0bfa2c5b-5325-4f87-8056-8b9d948cec8f","device_token_expiration":null},{"mapped_fleet_roles":{"admin":["
admin"]}}]}}}}},"End"]},"saga_name":"instance-start","start_node":24}
event time | sub saga | node id | event type | data
------------------------ | -------- | ---------------------------------------- | ---------- | ---
2026-05-06T13:15:56.682Z | | 24: start | started |
2026-05-06T13:15:56.693Z | | 24: start | succeeded |
2026-05-06T13:15:56.714Z | | 0: instance_start.generate_propolis_id | started |
2026-05-06T13:15:56.742Z | | 0: instance_start.generate_propolis_id | succeeded | "propolis_id" => "3fd10da7-73b4-4da7-9f7e-205c6c0d0dd3"
2026-05-06T13:15:56.761Z | | 1: instance_start.alloc_server | started |
On dogfood, currently running 19.3, if I create multiple concurrent requests for instances with more local storage than in total the system has, I can get a saga stuck.
This is running with the fix in #10353
Here is an example of such a saga: