-
Notifications
You must be signed in to change notification settings - Fork 1
Fix #334, add some doc. on how to replace the Manager in case of failure #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,15 @@ | |
| Production | ||
| ********** | ||
|
|
||
| Unlike the development environment, that uses Vagrant pre-configured virtual | ||
| machines, when dealing with production machines, you have to perform some | ||
| preliminary tasks in order for the provisioning procedure to be completed | ||
| successfully. It is required that you configure the to-be-provisioned | ||
| machines' network interfaces, as well as their disk partitions. You also have | ||
| to install on them the desired Operating System (Centos 6.8 for ACS running | ||
| machines, Centos 7.2 for storage). Without these preliminary tasks, the | ||
| provisioning procedure will most likely fail. | ||
|
|
||
| Machines deployment | ||
| =================== | ||
| To deploy the system in production, you have to specify a *cluster* of machines, | ||
|
|
@@ -74,3 +83,51 @@ tag you want to install on the machines: | |
| argument from both the ``discos-deploy`` and ``discos-get`` scripts. If you | ||
| pass the ``--station`` argument anyway, if the given argument does not match | ||
| the correct station you will receive an error and the procedure will stop. | ||
|
|
||
| Replace the Manager in case of failure | ||
| -------------------------------------- | ||
| In case the Manager machine suffers a failure of some sort, it has to be | ||
| replaced. In order to do this, the first thing to do is, perform the | ||
| provisioning procedure on a newly installed machine (after putting the new | ||
| Manager's IP address in the Ansible inventory's hosts file). In order | ||
| for the whole system to behave correctly it is also necessary to perform | ||
| some manual tweaking on the other DISCOS machines as well (in case the | ||
| DISCOS control system is running on a distributed environment. This is the | ||
| case for the SRT and Medicina stations). | ||
|
|
||
| The tweaks to be performed in order for the DISCOS control system to work as | ||
| expected are the following: | ||
|
|
||
| - Replace the old ACS Manager IP address reference with the new one in | ||
| ``/discos-sw/config/misc/bash_profile`` file in the ``discos-console`` | ||
| machine. It is stored as an environment variable called ``MNG_IP``. | ||
| - Replace the old Manager IP address with the new one in some fiels in the | ||
| DISCOS CDB. More specifically, one file has to be corrected in order for the | ||
| control system to be able to properly communicate with the ``TotalPower`` | ||
| backend, you can find this file in the repository of the currently deployed | ||
| released of DISCOS, under the directory | ||
| ``SRT/Configuration/CDB/alma/BACKENDS/TotalPower/TotalPower.xml``. | ||
| The variable to be corrected is called ``DataIPAddress``. This has to be | ||
| performed on the new Manager machine itself before launching the control | ||
| system. | ||
| - Make sure that all the station systems and machines accept incoming | ||
| connections from the newly allocated Manager's IP address. Specifically, the | ||
| ``TotalPower`` backend and the ``CalMux`` machines have to be tweaked in | ||
| order to allow them to be controlled by the new manager. | ||
|
|
||
| In order for the whole environment to work properly is also necessary to | ||
| perform some other tweaks on the other DISCOS machines, but not related to | ||
| the control system itself: | ||
|
|
||
| - Replace the old Manager IP address with the new one in ``/etc/hosts`` file in | ||
| ``discos-console`` and ``discos-storage`` machines (in case the DISCOS | ||
| control software is running on a distributed environment). This will allow | ||
| other services such as the Lustre service on the ``discos-storage`` machine | ||
| to point again to the correct IP address. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a procedure to point to? |
||
| - Perform the ssh key exchange procedure between the ``discos`` user of the | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does Mauro do all this things? :-D We need an example for him :-)
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a procedure that a generic observer can do. Performing the ssh key exchange requires knowing the password of both the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was joking, the point is that we have to write the documentation thinking that the reader is not one of the discos team... |
||
| newly installed Manager with the ones present on the ``discos-console`` and | ||
| ``discos-storage`` machines. The same procedure has to be performed between | ||
| the ``root`` users as well. This will allow some scripts such as the Lustre | ||
| service on the ``discos-storage`` machine and the ``discos-addProject`` and | ||
| ``discos-removeProject`` on the ``discos-console`` machine to perform some | ||
| remote tasks that would be impossible to be performed otherwise. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the procedure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This procedure involves logging in the said machines as root, if it has to be documented, this is not the place to do it. A suggestion about this is we perform this step in advance by allowing a range of addresses to control the said machines, so, in case of failure, this step can be skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No clear to me how it is possible to replicate the manager without any information about this point. I think the procedure should be documented somewhere, and in case this is not the place, here we have to put a reference link to it.