Skip to content

Conversation

@sjpb
Copy link
Collaborator

@sjpb sjpb commented Dec 10, 2025

Fixes an issue where sssd and sshd roles fail for compute-init, as their export.yml tasks write files to the cluster share as only readable by root. As the share is root-squashed these files cannot be retrieved by ansible-init on compute node boot. Other files were written by the slurm user but this is not considered appropriate for sensitive files.

The fix is to create a new "ansible-init" user and to use this for all writes and reads to/from the cluster share.

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 11, 2025

Image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20135647750
(cancelled due to merge from main)

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 11, 2025

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

Linting passed, cancelled other CI.
Image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20163910479

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

Weirdly it got stuck on the system users task, and ansible-init user appears to have gid of 100!

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

Confirmed problem was during fatimage build, on boot above image had:

[rocky@RL9-login-0 ~]$ id ansible-init
uid=301(ansible-init) gid=100(users) groups=100(users)

Did some local testing to get above change.

Image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20171889414

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

For some reason that image build linked above didn't even get the ansible-init user. Tried a local build to figure out why. That looked ok.

Trying build again in case there was some github wierdness: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20173380585.

Ok checked during build:

[rocky@openhpc-rl9-251212-1638-d12e658a ~]$ id ansible-init
uid=301(ansible-init) gid=301(ansible-init) groups=301(ansible-init)

presumably the runner didn't manage to pull the correct commit for some reason?!

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

Looks like its got stuck again at the "Add system users" task. Although logging into a CI VM that user at least has the right gid now:

[rocky@slurmci-RL8-450-control ~]$ id ansible-init
uid=301(ansible-init) gid=301(ansible-init) groups=301(ansible-init)

So maybe somehow there were two problems??

Last ansible entry in syslog is:

Dec 12 18:19:01 slurmci-RL8-450-control platform-python[7888]: ansible-ansible.builtin.user Invoked with name=ansible-init comment=ansible-init user uid=301 create_home=False shell=/sbin/nologin system=True state=present non_unique=False for
ce=False remove=False move_home=False append=False ssh_key_bits=0 ssh_key_type=rsa ssh_key_comment=ansible-generated on slurmci-RL8-450-control.slurmci-RL8-450.internal update_password=always group=None groups=None home=None password=NOT_LOG
GING_PARAMETER login_class=None password_expire_max=None password_expire_min=None password_expire_warn=None hidden=None seuser=None skeleton=None generate_ssh_key=None ssh_key_file=None ssh_key_passphrase=NOT_LOGGING_PARAMETER expires=None p
assword_lock=None local=None profile=None authorization=None role=None umask=None

this is interesting, not sure I expected to see a homedir:

ansible-init:x:301:301:ansible-init user:/home/ansible-init:/sbin/nologin

maybe it is using create_home: false, /home being on local disk in the build and an NFS share in the cluster which causes problems?

edit:
OK cannot reproduce this locally; using the same image as CI, running site.yml does not hang at that point 😢

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

Above "fix" really needs a new image building too, else it'll behave differently on the next image build but I'm going to let CI run to see if that also hangs ...

Image build: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/20178697087

@sjpb
Copy link
Collaborator Author

sjpb commented Dec 12, 2025

Ok above CI did get past the point it had hung, so going to push new image ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants