Skip to content

Add support for re-adopting physical disks#10221

Open
andrewjstone wants to merge 31 commits intomainfrom
manual-disk-adoption
Open

Add support for re-adopting physical disks#10221
andrewjstone wants to merge 31 commits intomainfrom
manual-disk-adoption

Conversation

@andrewjstone
Copy link
Copy Markdown
Contributor

@andrewjstone andrewjstone commented Apr 4, 2026

This change implements the determinations in RFD 663. It allows
re-adopting physical disks in the control plane after the control plane
level disk in the physical_disk table is expunged.

It does this by forcing manual adoption of disks by an operator, where
requests are placed in the physical_disk_adoption_request table.
A disk will now only be adopted or re-adopted by the disk adoption
background task if its physical vendor/model/serial information is
present in a physical_disk_adoption_request row.

The typical flow for an operator is to list uninitialized disks and then
issue an adoption request via the external API.

Comment thread nexus/src/app/sled.rs Outdated
@andrewjstone andrewjstone force-pushed the manual-disk-adoption branch from 67bdc40 to ce4b577 Compare April 7, 2026 20:35
@andrewjstone andrewjstone changed the title WIP: Manual disk adoption Add support for re-adopting physical disks Apr 7, 2026
This change implements the determinations in RFD 693. It allows
re-adopting physical disks in the control plane after the control plane
level disk in the `physical_disk` table is expunged.

It does this by forcing manual adoption of disks by an operator, where
requests are placed in the `physical_disk_adoption_request` table.
A disk will now only be adopted or re-adopted by the disk adoption
background task if its physical vendor/model/serial information is
present in a `physical_disk_adoption_request` row.

The typical flow for an operator is to list unitialized disks and then
issue an adoption request via the external API.
@andrewjstone andrewjstone force-pushed the manual-disk-adoption branch from ce4b577 to edc851b Compare April 7, 2026 20:42
@andrewjstone andrewjstone marked this pull request as ready for review April 7, 2026 20:42
Comment thread nexus/src/external_api/http_entrypoints.rs Outdated
@smklein
Copy link
Copy Markdown
Collaborator

smklein commented Apr 7, 2026

This change implements the determinations in RFD 693. It allows re-adopting physical disks in the control plane after the control plane level disk in the physical_disk table is expunged.

Nit: 663

@andrewjstone
Copy link
Copy Markdown
Contributor Author

This change implements the determinations in RFD 693. It allows re-adopting physical disks in the control plane after the control plane level disk in the physical_disk table is expunged.

Nit: 663

Whoops. I even have it open in a tab. Thanks!

Copy link
Copy Markdown
Contributor

@ahl ahl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my understanding, we have a couple of metaphors in here. Disks that are unknown to the control plane are "uninitialized" and appear in that list. The verb we're using is "adopt", as in "this uninitialized disk is adopted by the control plane". Is the term "adopt" intended to be the opposite of "expunge"?

I'm not clear on the how the operator would use this. Presumably there's supposed to be some step (e.g. of exploration or cognition) between "list uninit disks" and "approve disk(s)", but I'm not sure what it is.

We don’t want to allow automatic disk adoption due to the risk of the insertion of malicious hardware during casual physical access. This is especially problematic before we have disk attestation support, and in the case of existing sleds with empty disk bays.

Presumably I want to make sure the hardware I just put into the U.2 bay is the same as what I'm about to adopt. How do I do that?

The API changes look fine; I'd ask you think about nomenclature ("uninitialized" "adopt").

Comment thread nexus/external-api/src/lib.rs Outdated
query: Query<PaginationParams<EmptyScanParams, String>>,
) -> Result<
HttpResponseOk<
ResultsPage<latest::physical_disk::UninitializedPhysicalDisk>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I expunge a disk, I think that changes both the policy and state properties (I'm not sure why there are both of these--is one intent and the other is status?), does the disk immediately show up in this list?

Do disks show up in only one place or the other? Do some show up in both?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expunge happens immediately, and is a terminating enum variant. It's the intention of the desired state. Once it occurs we can assume it will never change and most things just look at expunge. Decommissioned state occurs after other steps, when the intended state is realized. So there is some delay there.

Comment thread nexus/external-api/src/lib.rs Outdated
@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 7, 2026

We don’t want to allow automatic disk adoption due to the risk of the insertion of malicious hardware during casual physical access. This is especially problematic before we have disk attestation support, and in the case of existing sleds with empty disk bays.

Presumably I want to make sure the hardware I just put into the U.2 bay is the same as what I'm about to adopt. How do I do that?

Is the idea that one would do this by comparing the manufacturer/model number/serial listed by the API endpoint with those physically printed on the actual disk, and also based on foreknowledge that disks have or have not been inserted in specific locations at the current point in time?

Comment on lines +300 to +304
pub async fn physical_disk_adoptable_list(
&self,
opctx: &OpContext,
inventory_collection_id: CollectionUuid,
) -> ListResultVec<InvPhysicalDisk> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about suggesting that this be paginated, but...do we expect the maximum number of rows to be limited by the physical fact that 32 sleds * 10 U.2s - 320 disks maximum?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really a good way to paginate this right now AFAIK.

Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs
/// A physical disk that has not yet been adopted by the control plane
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize, JsonSchema)]
pub struct UninitializedPhysicalDisk {
pub sled_id: SledUuid,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we expect the client to hydrate this sled UUID into a sled's physical location? it seems like it would be desirable for a UI listing physical disks that need to be adopted to be able to say which sled they are in as well as the slot within that sled...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I suppose we could also provide the sled cubby. That would help operators out a bit probably.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed we have this, right? Uninitalized physical disks come from sled-agent inventory, and sled-agent doesn't know its own cubby number. We could try to match up the sled serial against the SP inventory contents to identify a cubby, and that will usually work, but we'd still need to be able to represent "physical disk for sled X for cubby I Dunno Ask Again Later".

@andrewjstone
Copy link
Copy Markdown
Contributor Author

We don’t want to allow automatic disk adoption due to the risk of the insertion of malicious hardware during casual physical access. This is especially problematic before we have disk attestation support, and in the case of existing sleds with empty disk bays.

Presumably I want to make sure the hardware I just put into the U.2 bay is the same as what I'm about to adopt. How do I do that?

Is the idea that one would do this by comparing the manufacturer/model number/serial listed by the API endpoint with those physically printed on the actual disk, and also based on foreknowledge that disks have or have not been inserted in specific locations at the current point in time?

Yes, basically, if they actually cared. I think the larger security issue mitigated is that any disk inserted can only be activated by an operator and not just adopted automatically for usage. Checking the serial is how they would know which disk they were validating against.

),
),
)
// Ensure that each inventory disk has a valid adoption request
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a precondition that "you cannot have an adoption request for an already in-service disk"?

(Having already-in-service disks show up here seems wrong, just confirming what prevents that. I think the answer is "yes, a non-deleted adoption request means you have no live disks here")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a precondition.

Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs Outdated
Comment thread nexus/types/versions/src/manual_disk_adoption/physical_disk.rs Outdated
Comment thread nexus/src/external_api/http_entrypoints.rs Outdated
Comment thread nexus/src/app/background/tasks/physical_disk_adoption.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs
@andrewjstone
Copy link
Copy Markdown
Contributor Author

To the best of my understanding, we have a couple of metaphors in here. Disks that are unknown to the control plane are "uninitialized" and appear in that list. The verb we're using is "adopt", as in "this uninitialized disk is adopted by the control plane". Is the term "adopt" intended to be the opposite of "expunge"?

It's not necessarily the opposite of expunge. A user can insert a disk in an empty bay and it would be adopted for use. It's idle / uninitialized before that.

I'm not clear on the how the operator would use this. Presumably there's supposed to be some step (e.g. of exploration or cognition) between "list uninit disks" and "approve disk(s)", but I'm not sure what it is.

We don’t want to allow automatic disk adoption due to the risk of the insertion of malicious hardware during casual physical access. This is especially problematic before we have disk attestation support, and in the case of existing sleds with empty disk bays.

Presumably I want to make sure the hardware I just put into the U.2 bay is the same as what I'm about to adopt. How do I do that?

As Eliza noted, an operator would have to look at the vendor/model/serial on the drive and compare it to what shows up in the API.

The API changes look fine; I'd ask you think about nomenclature ("uninitialized" "adopt").

I'm open to changing this. adoption has been our internal de-facto term for a while. WE could say 'initializeoractivate` or something else.

Comment thread nexus/db-model/src/physical_disk.rs Outdated
Comment thread nexus/db-model/src/physical_disk.rs Outdated
/// A physical disk that has not yet been adopted by the control plane
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize, JsonSchema)]
pub struct UninitializedPhysicalDisk {
pub sled_id: SledUuid,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not guaranteed we have this, right? Uninitalized physical disks come from sled-agent inventory, and sled-agent doesn't know its own cubby number. We could try to match up the sled serial against the SP inventory contents to identify a cubby, and that will usually work, but we'd still need to be able to represent "physical disk for sled X for cubby I Dunno Ask Again Later".

Comment thread nexus/types/versions/src/manual_disk_adoption/physical_disk.rs Outdated
Comment thread nexus/src/app/sled.rs Outdated
Comment thread nexus/types/versions/src/manual_disk_adoption/physical_disk.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs
Comment thread nexus/src/external_api/http_entrypoints.rs Outdated
Comment thread schema/crdb/dbinit.sql Outdated
Comment thread schema/crdb/dbinit.sql Outdated
@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 8, 2026

I'm not clear on the how the operator would use this. Presumably there's supposed to be some step (e.g. of exploration or cognition) between "list uninit disks" and "approve disk(s)", but I'm not sure what it is.

We don’t want to allow automatic disk adoption due to the risk of the insertion of malicious hardware during casual physical access. This is especially problematic before we have disk attestation support, and in the case of existing sleds with empty disk bays.

Presumably I want to make sure the hardware I just put into the U.2 bay is the same as what I'm about to adopt. How do I do that?

As Eliza noted, an operator would have to look at the vendor/model/serial on the drive and compare it to what shows up in the API.

@smklein and I have been talking through disk replacement scenarios a bit from the fault management context, and I think the adoption requests will eventually be part of the service flow for disk replacements. I think in particular, we would really like it if the adoption requests could easily include the physical location of the disk as part of a "you just replaced the disk in sled 19 slot 3, okay yeah it's that one" kinda spot check.

@andrewjstone
Copy link
Copy Markdown
Contributor Author

To the best of my understanding, we have a couple of metaphors in here. Disks that are unknown to the control plane are "uninitialized" and appear in that list. The verb we're using is "adopt", as in "this uninitialized disk is adopted by the control plane". Is the term "adopt" intended to be the opposite of "expunge"?

@ahl @smklein @hawkw

I just talked about this with John a bit and if you don't like the term adopt, we could always use the word import. Then instead of uninitialized we would say unimported. We could also not mix metaphors by saying adopt and unadopted.

@ahl
Copy link
Copy Markdown
Contributor

ahl commented Apr 8, 2026

I'm good with whatever of those you choose. I appreciate you discussing.

@andrewjstone
Copy link
Copy Markdown
Contributor Author

andrewjstone commented Apr 8, 2026

@smklein and I have been talking through disk replacement scenarios a bit from the fault management context, and I think the adoption requests will eventually be part of the service flow for disk replacements. I think in particular, we would really like it if the adoption requests could easily include the physical location of the disk as part of a "you just replaced the disk in sled 19 slot 3, okay yeah it's that one" kinda spot check.

As @jgallagher pointed out earlier today, we don't actually have this information in sled-agent inventory. What happens if the sp inventory doesn't exist for this collection? Then we can't list the sled as uninitialized or we always need to carry around an option here. Would it make sense to tell a customer: "Hey, actually I don't know what cubby this disk is in right now." or "This sled is not uninitialized" event though the customer just inserted it into the rack and expects to see it?

@andrewjstone
Copy link
Copy Markdown
Contributor Author

One thing I just realized was that with the current code, we will no longer automatically adopt disks when sleds are added to a rack. I confirmed with @rmustacc that this was not what he intended. Unfortunately, I'm not immediately sure how to fit this behavior in with the current one. Disks are detected asynchronously after a sled is added and currently adopted by the background task automatically. The new behavior ensures that a disk adoption request is made by a user to allow adoption by the background task, but the adoption itself is still done asynchronously in the background task and is separate from the sled add request.

What we would want is a state attached to the sled that says automatically adopt disks that were present when the sled was added to the rack. But we don't really have any mechanism to discover that information. I think the best we can do with the current code base is say something like "automatically adopt disks for this sled for 5 minutes" after it has been created. Given inventory delays this is also problematic as a disk the customer expected to be adopted that was in the sled when it was added may not actually get adopted. That's a terrible user experience. We could lessen the likelihood of non-adoption by increasing this window, but that now lengthens the time when casual physical attack is possible by inserting arbitrary disks.

The only other thing I can think of doing is adding disk information to the SledAgentInfo that gets published to the nexus internal api when the sled-agent starts up. But this is a client-versioned API currently, which could be problematic if a sled needs to be added during an upgrade.

@hawkw
Copy link
Copy Markdown
Member

hawkw commented Apr 9, 2026

I just talked about this with John a bit and if you don't like the term adopt, we could always use the word import. Then instead of uninitialized we would say unimported. We could also not mix metaphors by saying adopt and unadopted.

NOT TO BE ANNOYING BUT: I really don't like "import" in this context, it feels like it is too easily misconstrued as "import the data that was on this disk", which is not what one would expect to be offered but which is a somewhat conceivable thing that might occur.

@andrewjstone
Copy link
Copy Markdown
Contributor Author

I just talked about this with John a bit and if you don't like the term adopt, we could always use the word import. Then instead of uninitialized we would say unimported. We could also not mix metaphors by saying adopt and unadopted.

NOT TO BE ANNOYING BUT: I really don't like "import" in this context, it feels like it is too easily misconstrued as "import the data that was on this disk", which is not what one would expect to be offered but which is a somewhat conceivable thing that might occur.

NOT ANNOYING AT ALL!!! I appreciate the feedback.

I'm not a huge fan of import either. I still like adopt. FWIW, claude does too, but it's a sycophant that hallucinated usage in both zfs and kubernetes, so do with that what you will.

I think to make the metaphors less mixed I'd switch to listing unadopted rather than uninitialized disks.

@andrewjstone
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews @smklein @hawkw @jgallagher

I believe I've addressed everything. Unfortunately, I discovered an issue that should probably be resolved before merge or explicitly decided not to implement: #10221 (comment)

@andrewjstone
Copy link
Copy Markdown
Contributor Author

One thing I just realized was that with the current code, we will no longer automatically adopt disks when sleds are added to a rack. I confirmed with @rmustacc that this was not what he intended. Unfortunately, I'm not immediately sure how to fit this behavior in with the current one. Disks are detected asynchronously after a sled is added and currently adopted by the background task automatically. The new behavior ensures that a disk adoption request is made by a user to allow adoption by the background task, but the adoption itself is still done asynchronously in the background task and is separate from the sled add request.

What we would want is a state attached to the sled that says automatically adopt disks that were present when the sled was added to the rack. But we don't really have any mechanism to discover that information. I think the best we can do with the current code base is say something like "automatically adopt disks for this sled for 5 minutes" after it has been created. Given inventory delays this is also problematic as a disk the customer expected to be adopted that was in the sled when it was added may not actually get adopted. That's a terrible user experience. We could lessen the likelihood of non-adoption by increasing this window, but that now lengthens the time when casual physical attack is possible by inserting arbitrary disks.

The only other thing I can think of doing is adding disk information to the SledAgentInfo that gets published to the nexus internal api when the sled-agent starts up. But this is a client-versioned API currently, which could be problematic if a sled needs to be added during an upgrade.

There are further problems that make the implementation next to impossible to do in an ideal manner.

  1. Before the sled-agent is up, it is not on the underlay network and so we can't ask it for the disks that are currently inserted.
  2. Even if we include those disks in the client-side versioned put to nexus from sled-agent, the disks themselves are loaded asynchronously by the hardware manager. It's possible that some of them haven't made themselves known yet.

We really seem to be restricted to a time based setup, or forcing manual disk adoption at all times.

@smklein
Copy link
Copy Markdown
Collaborator

smklein commented Apr 10, 2026

@andrewjstone and I chatted about this a bit offline. Recording some of our thoughts here:

  • In the short-term, it may make sense to keep the old behavior of "auto-adopt disks that haven't been part of the control plane before". We can make that old pathway create adoption requests, to unify the disk setup process. This still would allow an operator to re-add an expunged disk. The "auto-adoption" conditions could also be turned into a toggle, or turned off, at some point in the future.
  • We could read disk information from uninitalized sleds over the bootstrap network, and present that information as a part of "sled add" - basically, "sled add" could become "sled add AND create these disk adoption requests". We suspect this would not be a small task, but it's theoretically possible?

@andrewjstone
Copy link
Copy Markdown
Contributor Author

@andrewjstone and I chatted about this a bit offline. Recording some of our thoughts here:

* In the short-term, it may make sense to keep the old behavior of "auto-adopt disks that haven't been part of the control plane before". We can make that old pathway create adoption requests, to unify the disk setup process. This still would allow an operator to re-add an expunged disk. The "auto-adoption" conditions could also be turned into a toggle, or turned off, at some point in the future.

* We could read disk information from uninitalized sleds over the bootstrap network, and present that information as a part of "sled add" - basically, "sled add" could become "sled add AND create these disk adoption requests". We suspect this would not be a small task, but it's theoretically possible?

Based on discussion in update huddle last week, we decided that to move forward we would auto-adopt disks that haven't been part of the control plane before. c9618fc makes this change. Importantly it makes this change by inserting new disks into the physical_disk_adoption_request table and not changing the method that determines which disks are adoptable. This makes it easier to remove in the future. Thanks to @smklein for the suggestion.

@andrewjstone
Copy link
Copy Markdown
Contributor Author

This is ready for a re-review @jgallagher @smklein. I still need to test it on hardware which I'll do after feedback. I'd like to test either Thursday afternoon or Friday. Thanks!

Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs Outdated
Comment thread nexus/src/external_api/http_entrypoints.rs
Comment thread nexus/db-model/src/physical_disk.rs Outdated
Comment thread nexus/src/app/background/tasks/physical_disk_adoption.rs Outdated
Comment thread nexus/src/app/background/tasks/physical_disk_adoption.rs Outdated
Comment thread nexus/src/app/background/tasks/physical_disk_adoption.rs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was preexisting but since we're here, this is also an update

Suggested change
opctx.authorize(authz::Action::Modify, &authz::FLEET).await?;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment thread nexus/types/versions/src/manual_disk_adoption/physical_disk.rs Outdated
Comment thread nexus/types/versions/src/manual_disk_adoption/physical_disk.rs Outdated
Comment thread nexus/src/external_api/http_entrypoints.rs Outdated
Comment thread nexus/src/app/background/tasks/physical_disk_adoption.rs Outdated
Comment thread nexus/db-queries/src/db/datastore/physical_disk.rs Outdated
andrewjstone and others added 12 commits April 28, 2026 19:25
When we have an inventory disk we pass that in to `new` and set the
fields  inside that method. This prevents passing strings in the
wrong order. However, most callse to `PhysicalDisk::new` are in tests.
Rather than forcing the creation of an intermediate type we add a new
`from_parts` method to allow more verbose construction.
}

Ok(())
// Idempotent case: an adoption request already exists for this disk.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of "doing a transaction, then immediately doing another operation" - it makes this function non-atomic.

Suppose you try to insert a request, and see the unique violation. Normally we'd read the value from the DB (non transactionally!), but what if a concurrent DELETE happened? We'd proceed to the SELECT below, but we wouldn't find a request.

Then the physical_disk_enable_adoption function is returning "NOT FOUND"? that's weird, right?

Couldn't we catch the unique violation and do this query within the transaction?

Copy link
Copy Markdown
Contributor Author

@andrewjstone andrewjstone May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @smklein. I agreed with you here and made this change. Unfortunately it doesn't work. Once you hit a unique constraint error the transaction gets aborted. According to claude I can try to use a savepoint, then rollback to that savepoint and do the read. But this seems like it may lead to the same weirdness:

Inside the transaction there is a uniqueness error, so we rollback. Before the rollback someone deletes the adoption request that caused the rollback. We try to read it and return a not found error.

I think I prefer leaving things as is, as it's at least less complex to read. Since only one operator should be driving, there's very little likelihood of a request being added and deleted simultaneously.

Copy link
Copy Markdown
Collaborator

@smklein smklein May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, I'm surprised by that. I would have figured that the insert_into would be failing when we hit that constraint (it's an INSERT error, right??) and so if we call .get_result_async(&conn).await?, we'd return that error, exit the function, and cause the transaction to abort.

However, if we handle that error instead by matching on the result, shouldn't we be able to convert the "error" case of that Result to Ok(request), using a synthetic request (basically, matching whatever we were about to insert)?

I don't think there's anything magical about "hitting an error" in a transaction which causes it to abort - since transactions in diesel are mapped to functions, we should be only aborting when we return from this closure with an error, right? So as long as we return Ok(request), it should avoid aborting, right?

EDIT: Re-reading this; we could also "do the read once we see that first error" within the transaction still, right? I'm realizing we don't actually want to create a synthetic request - but we should still be able to do the read afterwards, as long as we don't bail early with an error?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, following up, I suppose you're right. The transaction aborted state exists on the postgres side!

One caveat; we could do:

  • "Optional" read, store the value
  • Then insert

That way: if we hit the insert on-conflict case, we already read the value prior to hitting a DB error

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same thing, but it doesn't work. Cockroach sets the internal transaction state to aborted and you can't return OK after that. Here's the patch I made:

diff --git a/nexus/db-queries/src/db/datastore/physical_disk.rs b/nexus/db-queries/src/db/datastore/physical_disk.rs
index f503e0343..d6d975a0b 100644
--- a/nexus/db-queries/src/db/datastore/physical_disk.rs
+++ b/nexus/db-queries/src/db/datastore/physical_disk.rs
@@ -177,68 +177,72 @@ impl DataStore {
                     }

                     // Insert the adoption request.
-                    let request = diesel::insert_into(
+                    let res = diesel::insert_into(
                         adoption_dsl::physical_disk_adoption_request,
                     )
                     .values((
                         adoption_dsl::id.eq(Uuid::new_v4()),
-                        adoption_dsl::vendor.eq(vendor),
-                        adoption_dsl::serial.eq(serial),
-                        adoption_dsl::model.eq(model),
+                        adoption_dsl::vendor.eq(vendor.clone()),
+                        adoption_dsl::serial.eq(serial.clone()),
+                        adoption_dsl::model.eq(model.clone()),
                         adoption_dsl::time_created.eq(Utc::now()),
                     ))
                     .returning(PhysicalDiskAdoptionRequest::as_returning())
                     .get_result_async(&conn)
-                    .await?;
-
-                    Ok(request)
+                    .await;
+
+                    let diesel_err = match res {
+                        Ok(request) => return Ok(request),
+                        Err(e) => e,
+                    };
+
+                    match diesel_err {
+                        // Check for a unique index violation for an active
+                        // adoption request with the same vendor/serial/model.
+                        //
+                        // Return the existing request in this case as the
+                        // request is idempotent.
+                        DieselError::DatabaseError(
+                            DieselErrorKind::UniqueViolation,
+                            _,
+                        ) => {
+                            // Idempotent case: an adoption request already exists for this disk.
+                            // Query it back and return it.
+                            adoption_dsl::physical_disk_adoption_request
+                                    .filter(
+                                        adoption_dsl::vendor.eq(vendor),
+                                    )
+                                    .filter(
+                                        adoption_dsl::serial.eq(serial),
+                                    )
+                                    .filter(
+                                        adoption_dsl::model.eq(model),
+                                    )
+                                    .filter(
+                                        adoption_dsl::time_deleted.is_null(),
+                                    )
+                                    .select(
+                                        PhysicalDiskAdoptionRequest::as_select(
+                                        ),
+                                    )
+                                    .first_async(&conn)
+                                    .await
+                        }
+                        e => Err(e),
+                    }
                 }
             })
             .await;

         match txn_res {
-            Ok(request) => return Ok(request),
+            Ok(request) => Ok(request),
             Err(e) => match err.take() {
                 // A called function performed its own error propagation.
-                Some(txn_error) => {
-                    return Err(txn_error.into_public_ignore_retries());
-                }
+                Some(txn_error) => Err(txn_error.into_public_ignore_retries()),
                 // The transaction setup/teardown itself encountered a diesel error.
-                None => match e {
-                    // Check for a unique index violation for an active
-                    // adoption request with the same vendor/serial/model.
-                    //
-                    // Return the existing request in this case as the
-                    // request is idempotent.
-                    DieselError::DatabaseError(
-                        DieselErrorKind::UniqueViolation,
-                        _,
-                    ) => {
-                        // Fall through to query the existing request below.
-                    }
-                    _ => {
-                        return Err(public_error_from_diesel(
-                            e,
-                            ErrorHandler::Server,
-                        ));
-                    }
-                },
+                None => Err(public_error_from_diesel(e, ErrorHandler::Server)),
             },
         }
-
-        // Idempotent case: an adoption request already exists for this disk.
-        // Query it back and return it.
-        let conn = &*self.pool_connection_authorized(opctx).await?;
-        let request = adoption_dsl::physical_disk_adoption_request
-            .filter(adoption_dsl::vendor.eq(disk_id.vendor))
-            .filter(adoption_dsl::serial.eq(disk_id.serial))
-            .filter(adoption_dsl::model.eq(disk_id.model))
-            .filter(adoption_dsl::time_deleted.is_null())
-            .select(PhysicalDiskAdoptionRequest::as_select())
-            .first_async(conn)
-            .await
-            .map_err(|e| public_error_from_diesel(e, ErrorHandler::Server))?;
-        Ok(request)
     }

     /// Stores a new physical disk in the database.

And here's the error I got back from the idempotent test:

   thread 'db::datastore::physical_disk::test::physical_disk_adoptable_list' (2) panicked at nexus/db-queries/src/db/datastore/physical_disk.rs:1590:14:
    adoption request succeeds: InternalError { internal_message: "unexpected database error: current transaction is aborted, commands ignored until end of transaction block" }
    ```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, if we still want this fast on the "no conflict case":

  self.transaction_retry_wrapper("physical_disk_enable_adoption")
      .transaction(&conn, |conn| async move {
          // Try the insert. If a row with the same unique key already
          // exists, this returns Ok(None) instead of raising a unique
          // violation — so the txn stays healthy.
          let inserted = diesel::insert_into(dsl::physical_disk_adoption)
              .values(&new_request)
              .on_conflict(dsl::physical_disk_id)   // or whatever key
              .do_nothing()
              .returning(PhysicalDiskAdoption::as_returning())
              .get_result_async(conn)
              .await
              .optional()?;

          // On conflict, fetch the existing row in the same txn —
          // atomic w.r.t. concurrent DELETEs, no NotFound race.
          let row = match inserted {
              Some(r) => r,
              None => {
                  dsl::physical_disk_adoption
                      .filter(dsl::physical_disk_id.eq(disk_id))
                      .filter(dsl::time_deleted.is_null())
                      .select(PhysicalDiskAdoption::as_select())
                      .get_result_async(conn)
                      .await?
              }
          };
          Ok(row)
      })
      .await

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Sean. I'll give this strategy a try. I"m also curious what CRDB will do if we use .optional().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now tried multiple ways to do what you suggested. Trying to use on_conflict((adoption_dsl::vendor, adoption_dsl::model, adoption_dsl::serial)) doesn't work because it doesn't include the WHERE clause in the index, and diesel doesn't support adding where clauses in on_conflict. I get the following error:

 thread 'db::datastore::physical_disk::test::physical_disk_adoptable_list' (2) panicked at nexus/db-queries/src/db/datastore/physical_disk.rs:1565:14:
    adoption request succeeds: InternalError { internal_message: "unexpected database error: there is no unique or exclusion constraint matching the ON CONFLICT specification" }

I then tried to use: .on_conflict(on_constraint("physical_disk_adoption_request_by_physical_id"))

but that also doesn't work with partial indexes. I ended up with he following error:

 thread 'db::datastore::physical_disk::test::physical_disk_unadopted_list' (2) panicked at nexus/db-queries/src/db/datastore/physical_disk.rs:1368:14:
    adoption request succeeds: InternalError { internal_message: "unexpected database error: unique constraint \"physical_disk_adoption_request_by_physical_id\" for table \"physical_disk_adoption_request\" is partial, so it cannot be used as an arbiter via the ON CONSTRAINT syntax" }

I'm pretty much out of ideas here besides trying to break out to raw sql, which I don't think is necessarily worth it and I'm not sure will actually help.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher suggested doing the lookup first and then if it's not there doing the insert. It's slightly more expensive, but that should fix this. I'm going to give it a try.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgallagher suggested doing the lookup first and then if it's not there doing the insert. It's slightly more expensive, but that should fix this. I'm going to give it a try.

This worked!

Copy link
Copy Markdown
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo one last comment about transaction atomicity expectations

@andrewjstone
Copy link
Copy Markdown
Contributor Author

Expunging and re-adding a disk succeeded.

Show disks in omdb before expunge

root@oxz_switch1:~# omdb db physical-disks
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd79:27cc:fa5e:103::3]:32221,[fd79:27cc:fa5e:102::3]:32221,[fd79:27cc:fa5e:104::3]:32221,[fd79:27cc:fa5e:101::3]:32221,[fd79:27cc:fa5e:104::4]:32221/omicron?sslmode=disable
note: database schema version matches expected (257.0.0)
note: listing all in-service disks (use -F to filter, e.g. -F in-service)
 ID                                    SERIAL    VENDOR  MODEL            SLED_ID                               POLICY      STATE
 03c736bc-fbe0-4a8c-9fc3-112c8cf885a8  A079DDCC  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 1d52414e-b0c5-4b88-bfe7-3a31fb3de8d1  A079E665  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 218b60a3-8842-46c6-aa8c-da8accddbdc0  A079E72E  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 21fd82c1-1b73-485e-ba25-7470afb02107  A084A74B  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 24163503-8933-4dbb-981f-e989a434797a  A084A5D7  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 314779bd-9f6e-48a3-af1b-44befea7a80d  A079E6CA  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 49e28d52-4d4a-4ba1-b4b9-06a2c762f174  A084A6DA  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 49fe2ff8-1c5b-449c-ad36-59db76ccf7bd  A084A6F8  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 4c15ec42-47d6-41d9-ac5c-c4d4f72281d7  A084A71C  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 5ef6323d-cf93-470f-a1af-e08ff3d8499e  A079E23A  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 5f77e327-5ee7-4ffb-b4d2-64e43a5deace  A084A7CB  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 5fc428cd-b8dd-4c5d-a73d-651511a99b4d  A079E567  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 6034f67e-9ab3-469b-a6c8-fd68a351c25b  A079E3D2  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 6e7d07fb-3c0e-4a65-af67-c37e4ab77926  A079E406  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 742519b0-6c73-4702-b639-4ef5c1e55059  A079DDE7  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 74943820-2472-4499-9fde-79cbf9bb47c6  A079E236  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 7504b7e2-e504-4731-9f62-9ef17b65c0b3  A079DDE6  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 7e457538-8ed2-4456-82a0-8a75b1a3909d  A079E6D4  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 8697afeb-8b55-46d4-be1b-19dc4b70913f  A079DE55  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 8bb29cfd-c542-4e9a-b262-4dd6c057fd8e  A084A788  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 914e9465-4ff5-4f0e-a47b-98644c9ea777  A079DF50  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 9351f49f-5b78-414f-8133-df2494971b90  A079DE0A  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 94b29db0-4561-4feb-9dd5-c3133fd503eb  A084A797  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 9bbd4071-e5e2-4342-b5eb-30e833bdb02a  A084A72C  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 b16a79c8-f12d-4539-bc0b-2053febd46da  A079E791  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 b7edc168-aed6-4838-b17e-a214829f0a87  A079E4A3  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 be20e6ac-fa80-4fa6-9e5a-c9fb07f44528  A079DF08  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 be493bbe-4b03-4bb9-a1be-52a2e5dd8609  A084A6CD  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 bfc27d86-f33a-4e0e-b7c1-3f6a981ade66  A079DFC2  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 c1a963e2-7b7c-4be6-bcae-028b55f23587  A079E56C  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 c45f6395-e3a9-4354-9ed2-ded65dc3dac4  A079DF28  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 ce9254ee-1941-4a78-bef0-5011ee996236  A084A789  1b96    WUS4C6432DSP3X3  0ca11342-b5ae-4332-98ae-c6bfcef65b1a  in service  active
 d5f6f296-dac8-485a-bd5b-d4db1a413e87  A079E3AD  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 dbe442e5-1479-4e5c-bcec-5c3763fbe57f  A079DF02  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active
 e6591ba3-1899-4756-8e24-c08d52ae244b  A079DFE3  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active
 f126346a-8777-4e5b-9feb-36252dca8033  A079E323  1b96    WUS4C6432DSP3X3  e9bc8867-7d1c-4e22-98d8-4155946379ac  in service  active
 fc1cb549-8968-47a5-801d-45356039f059  A084A780  1b96    WUS4C6432DSP3X3  48a3f2ac-3b35-46f6-850f-e5fc9c080c93  in service  active

Show unadopted disks via API before expunge

➜ oxide.rs git:(main) ✗ target/debug/oxide --profile recovery3 api '/v1/system/hardware/disks-unadopted' {
"items": [],
"next_page": null
}

Expunge first disk in the list via omdb

root@oxz_switch1:~# omdb nexus sleds expunge-disk 03c736bc-fbe0-4a8c-9fc3-112c8cf885a8
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd79:27cc:fa5e:104::5]:12232
Error: This command is potentially destructive. Pass the `-w` / `--destructive` flag to allow it.
root@oxz_switch1:~# omdb nexus sleds expunge-disk 03c736bc-fbe0-4a8c-9fc3-112c8cf885a8 -w
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd79:27cc:fa5e:102::4]:12232
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd79:27cc:fa5e:103::3]:32221,[fd79:27cc:fa5e:102::3]:32221,[fd79:27cc:fa5e:104::3]:32221,[fd79:27cc:fa5e:101::3]:32221,[fd79:27cc:fa5e:104::4]:32221/omicron?sslmode=disable
note: database schema version matches expected (257.0.0)
WARNING: physical disk 03c736bc-fbe0-4a8c-9fc3-112c8cf885a8 is PRESENT in the most recent inventory collection (spotted at 2026-05-08 23:36:45.176376 UTC). Although expunging a running disk is supported, it is safer to expunge a disk from a system where it has been removed. Are you sure you want to proceed anyway?
y/N〉y
WARNING: This operation will PERMANENTLY and IRRECOVABLY mark physical disk 03c736bc-fbe0-4a8c-9fc3-112c8cf885a8 (A079DDCC) expunged. To proceed, type the physical disk's serial number.
disk serial number〉A079DDCC
expunged disk 03c736bc-fbe0-4a8c-9fc3-112c8cf885a8

Show unadopted disks after the expunge

➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery3 api '/v1/system/hardware/disks-unadopted'
{
  "items": [
    {
      "disk_id": {
        "model": "WUS4C6432DSP3X3",
        "serial": "A079DDCC",
        "vendor": "1b96"
      },
      "sled_id": "15b02624-172e-4983-b615-5113c9ba5b4f",
      "slot": 8,
      "variant": "u2"
    }
  ],
  "next_page": null
}

Request an adoption of the disk via API

➜  oxide.rs git:(main) ✗ cat add-disk.json
{"vendor": "1b96", "model": "WUS4C6432DSP3X3", "serial": "A079DDCC"}

➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery3 api '/v1/system/hardware/disk-adoption-request' --method PUT --input add-disk.json
{
  "disk_id": {
    "model": "WUS4C6432DSP3X3",
    "serial": "A079DDCC",
    "vendor": "1b96"
  },
  "id": "ca9fa9c2-540f-4ced-ad76-b4a78fe1663b",
  "time_created": "2026-05-08T23:47:47.962554Z"
}

Wait for disk to be adopted and list unadopted again via API

➜  oxide.rs git:(main) ✗ target/debug/oxide --profile recovery3 api '/v1/system/hardware/disks-unadopted'
{
  "items": [],
  "next_page": null
}

Show that the disk exists with a different ID in omdb:

 99358bcd-3758-469d-b655-58a39a1aa24e  A079DDCC  1b96    WUS4C6432DSP3X3  15b02624-172e-4983-b615-5113c9ba5b4f  in service  active

Show that there is a deleted adoption request in the DB

root@[fd79:27cc:fa5e:103::3]:32221/omicron> select * from physical_disk_adoption_request;
                   id                  | vendor |      model      |  serial  |         time_created          |         time_deleted
---------------------------------------+--------+-----------------+----------+-------------------------------+-------------------------------
  ca9fa9c2-540f-4ced-ad76-b4a78fe1663b | 1b96   | WUS4C6432DSP3X3 | A079DDCC | 2026-05-08 23:47:47.962554+00 | 2026-05-08 23:47:56.57633+00
(1 row)

Show that there are two rows for the given physical disk, with one active.

root@[fd79:27cc:fa5e:103::3]:32221/omicron> select * from physical_disk where serial = 'A079DDCC';
                   id                  |         time_created          |         time_modified         | time_deleted | rcgen | vendor |  serial  |      model      | variant |               sled_id                | disk_policy |   disk_state
---------------------------------------+-------------------------------+-------------------------------+--------------+-------+--------+----------+-----------------+---------+--------------------------------------+-------------+-----------------
  03c736bc-fbe0-4a8c-9fc3-112c8cf885a8 | 2026-05-08 23:20:12.122451+00 | 2026-05-08 23:38:25.224545+00 | NULL         |     1 | 1b96   | A079DDCC | WUS4C6432DSP3X3 | u2      | 15b02624-172e-4983-b615-5113c9ba5b4f | expunged    | decommissioned
  99358bcd-3758-469d-b655-58a39a1aa24e | 2026-05-08 23:47:56.567818+00 | 2026-05-08 23:47:56.567818+00 | NULL         |     1 | 1b96   | A079DDCC | WUS4C6432DSP3X3 | u2      | 15b02624-172e-4983-b615-5113c9ba5b4f | in_service  | active
(2 rows)

I still need to test the expunge sled path, which I will do after dinner.

@andrewjstone
Copy link
Copy Markdown
Contributor Author

I forgot to delete the zpool before adding back the disk, and as expected we see:

00:56:21.703Z WARN SledAgent (ConfigReconcilerTask): Disk adoption failed
    disk_identity = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    error = Other error starting disk management: Observed Zpool with unexpected UUID (saw: ae4cf6c6-0cef-44d7-9812-baf47b469cd3, expected: 5ff1c86e-e07f-4d63-9bba-efa8833e9a06)
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:703

Then I removed the zpool via

zpool destroy oxp_ae4cf6c6-0cef-44d7-9812-baf47b469cd3

And, yay, successfully reconciled!

01:00:56.204Z INFO SledAgent (ConfigReconcilerTask): Disk already has a GPT
    file = sled-hardware/src/illumos/partitions.rs:217
    path = /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC0DC,0
01:00:56.229Z INFO SledAgent (ConfigReconcilerTask): GPT exists without Zpool: formatting zpool at /devices/pci@38,0/pci1de,fff9@1,2/pci1b96,0@0/blkdev@w0014EE81000BC0DC,0:a
    file = sled-hardware/src/disk.rs:385
01:00:56.229Z INFO SledAgent (ConfigReconcilerTask): Formatting zpool with requested ID
    file = sled-hardware/src/disk.rs:391
    id = 5ff1c86e-e07f-4d63-9bba-efa8833e9a06 (zpool)
01:00:56.407Z INFO SledAgent (ConfigReconcilerTask): Ensuring zpool has datasets
    disk_identity = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    file = sled-storage/src/dataset.rs:205
    zpool = External(5ff1c86e-e07f-4d63-9bba-efa8833e9a06 (external_zpool))
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): Loading latest secret
    disk_id = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    file = sled-storage/src/dataset.rs:263
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): Loaded latest secret
    disk_id = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    epoch = 1
    file = sled-storage/src/dataset.rs:265
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): Retrieving key
    disk_id = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    epoch = 1
    file = sled-storage/src/dataset.rs:270
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): Got key
    disk_id = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    epoch = 1
    file = sled-storage/src/dataset.rs:272
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): About to create keyfile
    file = sled-storage/src/keyfile.rs:30
    path = Keypath("/var/run/oxide/1b96-A079DDCC-WUS4C6432DSP3X3-zfs-aes-256-gcm.key")
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): Created keyfile
    file = sled-storage/src/keyfile.rs:35
    path = Keypath("/var/run/oxide/1b96-A079DDCC-WUS4C6432DSP3X3-zfs-aes-256-gcm.key")
01:00:56.425Z INFO SledAgent (ConfigReconcilerTask): Ensuring encrypted filesystem: oxp_5ff1c86e-e07f-4d63-9bba-efa8833e9a06/crypt for epoch 1
    file = sled-storage/src/dataset.rs:284
01:00:56.486Z INFO SledAgent (ConfigReconcilerTask): Zeroed and unlinked keyfile /var/run/oxide/1b96-A079DDCC-WUS4C6432DSP3X3-zfs-aes-256-gcm.key
    file = sled-storage/src/keyfile.rs:54
01:00:56.656Z INFO SledAgent (ConfigReconcilerTask): Finished ensuring zpool has datasets
    disk_identity = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    file = sled-storage/src/dataset.rs:346
    zpool = External(5ff1c86e-e07f-4d63-9bba-efa8833e9a06 (external_zpool))
01:00:56.656Z INFO SledAgent (ConfigReconcilerTask): Looking for unencrypted datasets in oxp_5ff1c86e-e07f-4d63-9bba-efa8833e9a06
    file = sled-storage/src/dataset.rs:394
01:00:56.678Z INFO SledAgent (ConfigReconcilerTask): Successfully started management of disk
    disk_identity = DiskIdentity { vendor: "1b96", model: "WUS4C6432DSP3X3", serial: "A079DDCC" }
    epoch = Some(Epoch(1))
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:695
01:00:56.689Z INFO SledAgent (ConfigReconcilerTask): Automatically archiving/wipe of dataset: oxp_5ff1c86e-e07f-4d63-9bba-efa8833e9a06/crypt/zone
    file = sled-agent/config-reconciler/src/reconciler_task/external_disks.rs:936
01:00:56.739Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:1057
    local_addr = [fd79:27cc:fa5e:104::1]:12345
    remote_addr = [fd79:27cc:fa5e:102::4]:56899
01:00:56.743Z INFO SledAgent (dropshot (SledAgent)): request completed
    file = /home/build/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.17.0/src/server.rs:874
root@271FVPY0:~#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants