Skip to content

Conversation

@vamahaja
Copy link
Member

@vamahaja vamahaja commented Nov 10, 2025

This PR introduces native integration with MAAS (Metal-as-a-Service) as a new provisioner backend for Teuthology.

Changes
- Implements a new maas.py provisioner module supporting the standard Teuthology interface
- Integrates with MAAS REST API using OAuth
- Handles machine allocation, deployment, status polling, and release workflows
- Compatible with existing configuration and supervisor selection logic
- unit/mock tests

Configuration:
- New maas section in teuthology.yaml, allowing API endpoint, credentials, machine types.

Document - https://pad.ceph.com/p/Maas-Teuthology-Integration
Tracker - https://tracker.ceph.com/issues/72252

@vamahaja vamahaja requested a review from a team as a code owner November 10, 2025 09:56
@vamahaja vamahaja requested review from VallariAg and kamoltat and removed request for a team November 10, 2025 09:56
@deepssin deepssin requested review from kshtsk and zmc November 10, 2025 10:29
@kshtsk
Copy link
Contributor

kshtsk commented Nov 10, 2025

Could you please add usage to https://github.com/ceph/teuthology/blob/main/docs/siteconfig.rst

@vamahaja
Copy link
Member Author

Could you please add usage to https://github.com/ceph/teuthology/blob/main/docs/siteconfig.rst

done

@vamahaja vamahaja force-pushed the maas-integration branch 2 times, most recently from 4cca149 to 707cf3b Compare November 12, 2025 14:31
Copy link
Member

@zmc zmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vamahaja!

I haven't yet had a chance to test this, so this is a pretty early review - but I'd like to ask that you include a couple unit tests. I don't know that we need to do as much as we do for FOG, but having some basic tests would help us be confident we don't totally break this by accident in the future.

@vamahaja vamahaja force-pushed the maas-integration branch 4 times, most recently from 6ade81e to 5bb9b9f Compare November 13, 2025 11:04
@vamahaja vamahaja force-pushed the maas-integration branch 2 times, most recently from a91cb97 to ae96f25 Compare November 17, 2025 15:34
@vamahaja
Copy link
Member Author

Thank you @vamahaja!

I haven't yet had a chance to test this, so this is a pretty early review - but I'd like to ask that you include a couple unit tests. I don't know that we need to do as much as we do for FOG, but having some basic tests would help us be confident we don't totally break this by accident in the future.

@zmc added unit tests for most of the libraries. PTAL.

@vamahaja vamahaja force-pushed the maas-integration branch 4 times, most recently from 8643a31 to bdfd3df Compare November 18, 2025 16:47
@vamahaja
Copy link
Member Author

We're getting closer! Thanks so much for being receptive to feedback and being patient with re-reviews as we go through the lab transition. A few more points:

All methods in the MAAS class that log should probably be using self.log, making messages like this: INFO:teuthology.provision.maas:Unexpected status ready to abort operation look more like this: INFO:teuthology.provision.maas.trial014:Unexpected status ready to abort operation (note the short hostname being mentioned)

I noticed that some error messages looked funny, like this being logged on a line by itself with no timestamp or other context: ('MaaS has no %s image', 'ubuntu/jammy') I made a note in a couple places where you raise RuntimeError("%s", value), because those seem to not behave as you're expecting, but I didn't want to add comments to every single one even though they should all be corrected to be passed a single f-string.

I also noticed that teuthology-supervisor was writing requests_oauthlib debug log messages to the supervisor log, which could potentially expose short-lived secrets; if you could turn down logging just for that library that would be helpful. I would suggest adding it here: https://github.com/ceph/teuthology/blob/main/teuthology/__init__.py#L54-L62

Thanks @zmc for all your comments. Updated all the log message and changed log object to use MAAS instance. Updated teuthology/__init__.py to suppress extra logging

@vamahaja vamahaja requested a review from zmc December 10, 2025 13:26
@djgalloway
Copy link
Contributor

Hey @vamahaja. Thanks so much for your work so far on this. I found a couple more issues today.

  1. The plugin should reimage each machine for every job regardless of what the existing OS on it is
/home/teuthworker/archive/dgalloway-2025-12-11_01:04:12-smoke-tentacle-distro-default-trial# cat 73/supervisor.73.log | grep trial024
2025-12-11T01:04:18.783 INFO:teuthology.lock.ops:Start node 'trial024.front.sepia.ceph.com' reimaging
2025-12-11T01:04:18.783 INFO:teuthology.lock.ops:Updating [trial024.front.sepia.ceph.com]: reset os type and version on server
2025-12-11T01:04:18.783 INFO:teuthology.lock.ops:Updating trial024.front.sepia.ceph.com on lock server
2025-12-11T01:04:18.904 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial024
2025-12-11T01:04:20.177 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial024
2025-12-11T01:04:21.357 INFO:teuthology.provision.maas.trial024:Machine 'trial024' is deployed with OS type 'ubuntu' and version '22.04'
2025-12-11T01:04:21.357 INFO:teuthology.provision.maas.trial024:Locking machine 'trial024' as OS requirement are already met
2025-12-11T01:04:21.407 ERROR:teuthology.provision.maas.trial024:Got status 409 from /machines/6mt6b8/op-lock: 'Machine is already locked'
2025-12-11T01:04:21.434 ERROR:teuthology.lock.ops:Refusing to unlock trial024 since it has an active job: dgalloway-2025-12-11_01:04:12-smoke-tentacle-distro-default-trial/73
  1. Not certain if this is a bug specific to this plugin. Maybe? See lock.ops.unlock_one_safe: Invert run-match logic #2036
2025-12-10T21:48:58.913 ERROR:teuthology.dispatcher.supervisor:Reimaging error. Unlocking machines...
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/dispatcher/supervisor.py", line 226, in reimage
    reimaged = lock_ops.reimage_machines(ctx, targets, job_config['machine_type'])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/lock/ops.py", line 359, in reimage_machines
    with teuthology.parallel.parallel() as p:
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/provision/__init__.py", line 49, in reimage
    result = obj.create()
             ^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/provision/maas.py", line 360, in create
    self._verify_installed_os()
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/provision/maas.py", line 426, in _verify_installed_os
    if self.remote.os != wanted_os:
       ^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/orchestra/remote.py", line 335, in os
    os_release = self.sh('cat /etc/os-release').strip()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/orchestra/remote.py", line 97, in sh
    proc = self.run(**kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/orchestra/remote.py", line 574, in run
    raise ConnectionError(f'Failed to reconnect to {self.shortname}')
ConnectionError: Failed to reconnect to trial036
2025-12-10T21:48:59.308 INFO:teuthology.dispatcher.supervisor:Unlocking machines...
2025-12-10T21:48:59.571 INFO:teuthology.orchestra.remote:Trying to reconnect to host 'ubuntu@trial024.front.sepia.ceph.com'
2025-12-10T21:48:59.571 DEBUG:teuthology.orchestra.connection:{'hostname': 'trial024.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2025-12-10T21:48:59.582 DEBUG:teuthology.orchestra.remote:[Errno 13] Permission denied: '/home/teuthworker/.ssh/id_ed25519'
2025-12-10T21:48:59.582 WARNING:teuthology.contextutil:'reconnect to trial024' reached maximum tries (4) after waiting for 30 seconds
2025-12-10T21:48:59.607 ERROR:teuthology.lock.ops:Refusing to unlock trial024 since it has an active job: dgalloway-2025-12-10_21:43:11-smoke-tentacle-distro-default-trial/68
2025-12-10T21:48:59.621 ERROR:teuthology.lock.ops:Refusing to unlock trial036 since it has an active job: dgalloway-2025-12-10_21:43:11-smoke-tentacle-distro-default-trial/68
2025-12-10T21:48:59.621 DEBUG:teuthology.parallel:result is False
2025-12-10T21:48:59.636 ERROR:teuthology.lock.ops:Refusing to unlock trial089 since it has an active job: dgalloway-2025-12-10_21:43:11-smoke-tentacle-distro-default-trial/68
2025-12-10T21:48:59.637 DEBUG:teuthology.parallel:result is False
2025-12-10T21:48:59.637 DEBUG:teuthology.report:Pushing job info to https://paddles.front.sepia.ceph.com/

@djgalloway
Copy link
Contributor

djgalloway commented Dec 11, 2025

Another thing.. When this code unlocks machines from a teuthology perspective, it unlocks the machine in MaaS but still leaves it in "Deployed" state. It needs to also POST /MAAS/api/2.0/machines/op-release

EDIT: and not return until the machine leaves Releasing state

if (self.os_type, self.os_version, os_type, os_version).count(None) == 4:
raise RuntimeError(f"Unable to find OS details for machine {name}")

if ((os_type and os_version) and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we even checking the existing os, much less warning if it's different than what's requested? The point of this code is to ensure that the system has the requested os/version deployed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier thought process was to ignore deployment when machine is already has requested OS.

Changes approach now, machine will get deployed for each time, set os default to Ubuntu 22.04.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For perspective, we used to just leave the testnodes installed and not reimage them every time but it become impossible to ensure a sterile environment for each teuthology job. So now we reimage before every job unconditionally. Thanks.

"""Reimage machines with https://maas.io"""

def __init__(
self, name: str, os_type: Optional[str] = None, os_version: Optional[str] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are os_type/os_version really optional? Is that a call to "reprovision how you already are"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now set default to Ubuntu 22.04.

fog_types = fog.get_types()
if machine_type in pelagos_types and machine_type in fog_types:
maas_types = maas.get_types()
if (machine_type in pelagos_types and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is no longer the correct test; it should be "only in one of the three _types lists"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmick can you please share more context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what I mean is, now the test is for "is it present in all three provisioners' types", when what it should be is "is it present in more than one"

f"Current status: {data.get('status_name')}",
)

def release_machine(self, erase: bool = True) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is erase = True really the right default? It'll take more time, and given that we're redeploying the OS, does it accomplish anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it takes time and not required for each operation. Set default to False.

elif data := resp.json():
if data.get("locked"):
raise RuntimeError(
f"Machine '{self.shortname}' locking failed, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unlocking failed"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@djgalloway
Copy link
Contributor

Feature request: set the machine's description to the job name when locking

@vamahaja vamahaja force-pushed the maas-integration branch 2 times, most recently from c30ecc4 to e9019d7 Compare December 17, 2025 06:12
@vamahaja
Copy link
Member Author

Hey @vamahaja. Thanks so much for your work so far on this. I found a couple more issues today.

1. The plugin should reimage each machine for every job regardless of what the existing OS on it is
/home/teuthworker/archive/dgalloway-2025-12-11_01:04:12-smoke-tentacle-distro-default-trial# cat 73/supervisor.73.log | grep trial024
2025-12-11T01:04:18.783 INFO:teuthology.lock.ops:Start node 'trial024.front.sepia.ceph.com' reimaging
2025-12-11T01:04:18.783 INFO:teuthology.lock.ops:Updating [trial024.front.sepia.ceph.com]: reset os type and version on server
2025-12-11T01:04:18.783 INFO:teuthology.lock.ops:Updating trial024.front.sepia.ceph.com on lock server
2025-12-11T01:04:18.904 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial024
2025-12-11T01:04:20.177 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial024
2025-12-11T01:04:21.357 INFO:teuthology.provision.maas.trial024:Machine 'trial024' is deployed with OS type 'ubuntu' and version '22.04'
2025-12-11T01:04:21.357 INFO:teuthology.provision.maas.trial024:Locking machine 'trial024' as OS requirement are already met
2025-12-11T01:04:21.407 ERROR:teuthology.provision.maas.trial024:Got status 409 from /machines/6mt6b8/op-lock: 'Machine is already locked'
2025-12-11T01:04:21.434 ERROR:teuthology.lock.ops:Refusing to unlock trial024 since it has an active job: dgalloway-2025-12-11_01:04:12-smoke-tentacle-distro-default-trial/73
2. Not certain if this is a bug specific to this plugin.  Maybe?  See [lock.ops.unlock_one_safe: Invert run-match logic #2036](https://github.com/ceph/teuthology/pull/2036)
2025-12-10T21:48:58.913 ERROR:teuthology.dispatcher.supervisor:Reimaging error. Unlocking machines...
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/dispatcher/supervisor.py", line 226, in reimage
    reimaged = lock_ops.reimage_machines(ctx, targets, job_config['machine_type'])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/lock/ops.py", line 359, in reimage_machines
    with teuthology.parallel.parallel() as p:
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 84, in __exit__
    for result in self:
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 98, in __next__
    resurrect_traceback(result)
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 30, in resurrect_traceback
    raise exc.exc_info[1]
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/parallel.py", line 23, in capture_traceback
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/provision/__init__.py", line 49, in reimage
    result = obj.create()
             ^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/provision/maas.py", line 360, in create
    self._verify_installed_os()
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/provision/maas.py", line 426, in _verify_installed_os
    if self.remote.os != wanted_os:
       ^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/orchestra/remote.py", line 335, in os
    os_release = self.sh('cat /etc/os-release').strip()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/orchestra/remote.py", line 97, in sh
    proc = self.run(**kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/teuthworker/src/github.com_ceph_teuthology_596e0a42f088a62e6292d6586fccd535f86076d4/teuthology/orchestra/remote.py", line 574, in run
    raise ConnectionError(f'Failed to reconnect to {self.shortname}')
ConnectionError: Failed to reconnect to trial036
2025-12-10T21:48:59.308 INFO:teuthology.dispatcher.supervisor:Unlocking machines...
2025-12-10T21:48:59.571 INFO:teuthology.orchestra.remote:Trying to reconnect to host 'ubuntu@trial024.front.sepia.ceph.com'
2025-12-10T21:48:59.571 DEBUG:teuthology.orchestra.connection:{'hostname': 'trial024.front.sepia.ceph.com', 'username': 'ubuntu', 'timeout': 60}
2025-12-10T21:48:59.582 DEBUG:teuthology.orchestra.remote:[Errno 13] Permission denied: '/home/teuthworker/.ssh/id_ed25519'
2025-12-10T21:48:59.582 WARNING:teuthology.contextutil:'reconnect to trial024' reached maximum tries (4) after waiting for 30 seconds
2025-12-10T21:48:59.607 ERROR:teuthology.lock.ops:Refusing to unlock trial024 since it has an active job: dgalloway-2025-12-10_21:43:11-smoke-tentacle-distro-default-trial/68
2025-12-10T21:48:59.621 ERROR:teuthology.lock.ops:Refusing to unlock trial036 since it has an active job: dgalloway-2025-12-10_21:43:11-smoke-tentacle-distro-default-trial/68
2025-12-10T21:48:59.621 DEBUG:teuthology.parallel:result is False
2025-12-10T21:48:59.636 ERROR:teuthology.lock.ops:Refusing to unlock trial089 since it has an active job: dgalloway-2025-12-10_21:43:11-smoke-tentacle-distro-default-trial/68
2025-12-10T21:48:59.637 DEBUG:teuthology.parallel:result is False
2025-12-10T21:48:59.637 DEBUG:teuthology.report:Pushing job info to https://paddles.front.sepia.ceph.com/

@djgalloway thanks, this made class deployment lot easy

@vamahaja
Copy link
Member Author

Another thing.. When this code unlocks machines from a teuthology perspective, it unlocks the machine in MaaS but still leaves it in "Deployed" state. It needs to also POST /MAAS/api/2.0/machines/op-release

EDIT: and not return until the machine leaves Releasing state

Updated logic. Unlock operation will move machine from Ready to Released state now.

@vamahaja vamahaja force-pushed the maas-integration branch 4 times, most recently from 2d7f162 to 8d91ade Compare December 17, 2025 09:47
@djgalloway
Copy link
Contributor

Another thing.. When this code unlocks machines from a teuthology perspective, it unlocks the machine in MaaS but still leaves it in "Deployed" state. It needs to also POST /MAAS/api/2.0/machines/op-release
EDIT: and not return until the machine leaves Releasing state

Updated logic. Unlock operation will move machine from Ready to Released state now.

Wait, that seems backwards. There are two layers of ownership a machine can be set to in MaaS. There will be one of, Allocated, Deploying, Deployed, Releasing. The healthy state of a machine that is ready for a new job is Ready.

Then there is Locked/Unlocked which is an additional layer on top of any of those statuses.

teuthology-lock should:
Lock - Allocate, Deploy, Lock
Unlock - Release, WAIT until no longer in Releasing, then Unlock

@vamahaja
Copy link
Member Author

Another thing.. When this code unlocks machines from a teuthology perspective, it unlocks the machine in MaaS but still leaves it in "Deployed" state. It needs to also POST /MAAS/api/2.0/machines/op-release
EDIT: and not return until the machine leaves Releasing state

Updated logic. Unlock operation will move machine from Ready to Released state now.

Wait, that seems backwards. There are two layers of ownership a machine can be set to in MaaS. There will be one of, Allocated, Deploying, Deployed, Releasing. The healthy state of a machine that is ready for a new job is Ready.

Then there is Locked/Unlocked which is an additional layer on top of any of those statuses.

teuthology-lock should: Lock - Allocate, Deploy, Lock Unlock - Release, WAIT until no longer in Releasing, then Unlock

@djgalloway only deployed machines can be locked or unlocked.

then teuthology-lock should:
Lock - Deploy (machine get deployed from Allocated state), Lock
Unlock - Unlock, Release, WAIT until no longer in Releasing, then Allocate

@djgalloway
Copy link
Contributor

Unlock - Unlock, Release, WAIT until no longer in Releasing, then Allocate

Almost. Unlock, Release, Releasing, Ready. If the machine gets left in Allocated state, other folks can't lock or unlock using their own API key.

@vamahaja
Copy link
Member Author

Unlock - Unlock, Release, WAIT until no longer in Releasing, then Allocate

Almost. Unlock, Release, Releasing, Ready. If the machine gets left in Allocated state, other folks can't lock or unlock using their own API key.

In current logic, machine gets into Ready state after unlocking machine.

@djgalloway
Copy link
Contributor

Unlock - Unlock, Release, WAIT until no longer in Releasing, then Allocate

Almost. Unlock, Release, Releasing, Ready. If the machine gets left in Allocated state, other folks can't lock or unlock using their own API key.

In current logic, machine gets into Ready state after unlocking machine.

I see now you've got https://github.com/ceph/teuthology/pull/2105/files#diff-92717a0d2d3124f0b165253a8a99be36ee635e13a28fb33fa11f6a3e78ac081dR339. 👍

@djgalloway
Copy link
Contributor

teuthworker@soko04:~/mnt/teuthology/dgalloway-2025-12-19_21:52:33-smoke-tentacle-distro-default-trial/866$ grep trial111 supervisor.866.log 
2025-12-19T21:59:14.187 INFO:teuthology.lock.ops:Start node 'trial111.front.sepia.ceph.com' reimaging
2025-12-19T21:59:14.187 INFO:teuthology.lock.ops:Updating [trial111.front.sepia.ceph.com]: reset os type and version on server
2025-12-19T21:59:14.187 INFO:teuthology.lock.ops:Updating trial111.front.sepia.ceph.com on lock server
2025-12-19T21:59:14.261 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:14.916 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:15.366 INFO:teuthology.provision.maas.trial111:Deploying machine with os type 'centos' and version '9.stream'
2025-12-19T21:59:15.400 ERROR:teuthology.provision.maas.trial111:Error during deployment of machine 'trial111', aborting deployment
2025-12-19T21:59:15.401 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:16.205 INFO:teuthology.provision.maas.trial111:Cannot abort machine in 'ready' state; skipping abort operation.
2025-12-19T21:59:16.205 INFO:teuthology.provision.maas.trial111:Waiting for machine 'trial111' with system_id '7b8xa4' to reach status 'Ready'
2025-12-19T21:59:16.207 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:16.848 INFO:teuthology.provision.maas:MaaS machine system 'trial111' with system_id '7b8xa4' reached status 'Ready'
RuntimeError: Deployment of machine 'trial111' failed
2025-12-19T21:59:17.378 ERROR:teuthology.lock.ops:Refusing to unlock trial111 since it has an active job: dgalloway-2025-12-19_21:52:33-smoke-tentacle-distro-default-trial/866

Unclear what happened here but nothing was actually done to it in MaaS according to MaaS' logs.

@vamahaja
Copy link
Member Author

teuthworker@soko04:~/mnt/teuthology/dgalloway-2025-12-19_21:52:33-smoke-tentacle-distro-default-trial/866$ grep trial111 supervisor.866.log 
2025-12-19T21:59:14.187 INFO:teuthology.lock.ops:Start node 'trial111.front.sepia.ceph.com' reimaging
2025-12-19T21:59:14.187 INFO:teuthology.lock.ops:Updating [trial111.front.sepia.ceph.com]: reset os type and version on server
2025-12-19T21:59:14.187 INFO:teuthology.lock.ops:Updating trial111.front.sepia.ceph.com on lock server
2025-12-19T21:59:14.261 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:14.916 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:15.366 INFO:teuthology.provision.maas.trial111:Deploying machine with os type 'centos' and version '9.stream'
2025-12-19T21:59:15.400 ERROR:teuthology.provision.maas.trial111:Error during deployment of machine 'trial111', aborting deployment
2025-12-19T21:59:15.401 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:16.205 INFO:teuthology.provision.maas.trial111:Cannot abort machine in 'ready' state; skipping abort operation.
2025-12-19T21:59:16.205 INFO:teuthology.provision.maas.trial111:Waiting for machine 'trial111' with system_id '7b8xa4' to reach status 'Ready'
2025-12-19T21:59:16.207 DEBUG:requests_oauthlib.oauth1_auth:Updated url: http://soko02.front.sepia.ceph.com:5240/MAAS/api/2.0//machines/?hostname=trial111
2025-12-19T21:59:16.848 INFO:teuthology.provision.maas:MaaS machine system 'trial111' with system_id '7b8xa4' reached status 'Ready'
RuntimeError: Deployment of machine 'trial111' failed
2025-12-19T21:59:17.378 ERROR:teuthology.lock.ops:Refusing to unlock trial111 since it has an active job: dgalloway-2025-12-19_21:52:33-smoke-tentacle-distro-default-trial/866

Unclear what happened here but nothing was actually done to it in MaaS according to MaaS' logs.

Looks like issue with teuthology lock operation. One of the job is already occupying machine on which deployment operation is performed -

2025-12-19T21:59:17.378 ERROR:teuthology.lock.ops:Refusing to unlock trial111 since it has an active job: dgalloway-2025-12-19_21:52:33-smoke-tentacle-distro-default-trial/866

deepssin added a commit to deepssin/teuthology that referenced this pull request Dec 24, 2025
Add user-data template for MAAS-provisioned CentOS 9 Stream machines.
This template configures cloud-init to:
- Install required packages (python3, wget, git, chrony, grubby, bzip2
- Clean up conflicting user with uid 1000 since ansible sets it

The bzip2 package is required for workunit tests that extract
.tar.bz2 archives.

Depends on: ceph#2105

Signed-off-by: deepssin <deepssin@redhat.com>
Signed-off-by: Vaibhav Mahajan <vaibhavsm04@gmail.com>
@djgalloway
Copy link
Contributor

I don't think this correctly handles machines that are in state Deploying when a job is killed. To reproduce:

  1. Schedule a suite
  2. When the machines are Deploying state and job is in Waiting
  3. Kill the job. Observe all jobs get marked dead immediately.

All deployments continue and the lock logic actually flips the jobs from Dead to Running in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants