-
Notifications
You must be signed in to change notification settings - Fork 306
[Lib] Add MAAS (Metal-as-a-Service) provisioner #2105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
b50ce3f to
9fe05ac
Compare
9fe05ac to
b532631
Compare
|
Could you please add usage to https://github.com/ceph/teuthology/blob/main/docs/siteconfig.rst |
b532631 to
0db667b
Compare
done |
4cca149 to
707cf3b
Compare
zmc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @vamahaja!
I haven't yet had a chance to test this, so this is a pretty early review - but I'd like to ask that you include a couple unit tests. I don't know that we need to do as much as we do for FOG, but having some basic tests would help us be confident we don't totally break this by accident in the future.
6ade81e to
5bb9b9f
Compare
5bb9b9f to
a27b6ee
Compare
a91cb97 to
ae96f25
Compare
@zmc added unit tests for most of the libraries. PTAL. |
8643a31 to
bdfd3df
Compare
Thanks @zmc for all your comments. Updated all the log message and changed log object to use MAAS instance. Updated |
|
Hey @vamahaja. Thanks so much for your work so far on this. I found a couple more issues today.
|
|
Another thing.. When this code unlocks machines from a teuthology perspective, it unlocks the machine in MaaS but still leaves it in "Deployed" state. It needs to also EDIT: and not return until the machine leaves Releasing state |
teuthology/provision/maas.py
Outdated
| if (self.os_type, self.os_version, os_type, os_version).count(None) == 4: | ||
| raise RuntimeError(f"Unable to find OS details for machine {name}") | ||
|
|
||
| if ((os_type and os_version) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we even checking the existing os, much less warning if it's different than what's requested? The point of this code is to ensure that the system has the requested os/version deployed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier thought process was to ignore deployment when machine is already has requested OS.
Changes approach now, machine will get deployed for each time, set os default to Ubuntu 22.04.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For perspective, we used to just leave the testnodes installed and not reimage them every time but it become impossible to ensure a sterile environment for each teuthology job. So now we reimage before every job unconditionally. Thanks.
teuthology/provision/maas.py
Outdated
| """Reimage machines with https://maas.io""" | ||
|
|
||
| def __init__( | ||
| self, name: str, os_type: Optional[str] = None, os_version: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are os_type/os_version really optional? Is that a call to "reprovision how you already are"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now set default to Ubuntu 22.04.
| fog_types = fog.get_types() | ||
| if machine_type in pelagos_types and machine_type in fog_types: | ||
| maas_types = maas.get_types() | ||
| if (machine_type in pelagos_types and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is no longer the correct test; it should be "only in one of the three _types lists"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmick can you please share more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what I mean is, now the test is for "is it present in all three provisioners' types", when what it should be is "is it present in more than one"
teuthology/provision/maas.py
Outdated
| f"Current status: {data.get('status_name')}", | ||
| ) | ||
|
|
||
| def release_machine(self, erase: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is erase = True really the right default? It'll take more time, and given that we're redeploying the OS, does it accomplish anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it takes time and not required for each operation. Set default to False.
teuthology/provision/maas.py
Outdated
| elif data := resp.json(): | ||
| if data.get("locked"): | ||
| raise RuntimeError( | ||
| f"Machine '{self.shortname}' locking failed, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unlocking failed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
Feature request: set the machine's description to the job name when locking |
c30ecc4 to
e9019d7
Compare
@djgalloway thanks, this made class deployment lot easy |
Updated logic. Unlock operation will move machine from |
2d7f162 to
8d91ade
Compare
Wait, that seems backwards. There are two layers of ownership a machine can be set to in MaaS. There will be one of, Allocated, Deploying, Deployed, Releasing. The healthy state of a machine that is ready for a new job is Ready. Then there is Locked/Unlocked which is an additional layer on top of any of those statuses.
|
@djgalloway only deployed machines can be locked or unlocked. then |
Almost. Unlock, Release, Releasing, Ready. If the machine gets left in Allocated state, other folks can't lock or unlock using their own API key. |
In current logic, machine gets into |
I see now you've got https://github.com/ceph/teuthology/pull/2105/files#diff-92717a0d2d3124f0b165253a8a99be36ee635e13a28fb33fa11f6a3e78ac081dR339. 👍 |
Unclear what happened here but nothing was actually done to it in MaaS according to MaaS' logs. |
Looks like issue with teuthology lock operation. One of the job is already occupying machine on which deployment operation is performed - |
Add user-data template for MAAS-provisioned CentOS 9 Stream machines. This template configures cloud-init to: - Install required packages (python3, wget, git, chrony, grubby, bzip2 - Clean up conflicting user with uid 1000 since ansible sets it The bzip2 package is required for workunit tests that extract .tar.bz2 archives. Depends on: ceph#2105 Signed-off-by: deepssin <deepssin@redhat.com>
Signed-off-by: Vaibhav Mahajan <vaibhavsm04@gmail.com>
8d91ade to
e0e755f
Compare
|
I don't think this correctly handles machines that are in state Deploying when a job is killed. To reproduce:
All deployments continue and the lock logic actually flips the jobs from Dead to Running in some cases. |
This PR introduces native integration with MAAS (Metal-as-a-Service) as a new provisioner backend for Teuthology.
Changes
- Implements a new maas.py provisioner module supporting the standard Teuthology interface
- Integrates with MAAS REST API using OAuth
- Handles machine allocation, deployment, status polling, and release workflows
- Compatible with existing configuration and supervisor selection logic
- unit/mock tests
Configuration:
- New
maassection inteuthology.yaml, allowing API endpoint, credentials, machine types.Document - https://pad.ceph.com/p/Maas-Teuthology-Integration
Tracker - https://tracker.ceph.com/issues/72252