Releases: ORNL-TechInt/oddmon
Fix handling of job_stats data
The major change for this release centers around proper handling of job_stats data. Specifically, when Lustre removes a job from the job_stats file, oddmon will cease tracking it as well. If the job subsequently reappears in the job_stats file, oddmon's internal counters for that job will start over at zero and thus the calculated delta values will be correct.
(Note that this situation is quite common in HPC jobs: They'll often perform a burst of I/O, then run a compute phase that lasts long enough for Lustre to drop them from the job_stats file, and then perform another burst of I/O.)
Better error handling for the Pika calls
Enabled delivery confirmation and set the 'mandatory' flag for the RabbitMQ messages. With delivery confirmations enabled, the message publish function won't return until the message has been successfully sent and acknowledged. Prior to this change, the publish function would return immediately - despite it supposedly being a blocking connection - and messaging errors wouldn't be detected until the close() function was executed.
Sub-process also sets an event flag after the publish function succeeds so that the main process can tell the difference between the sub-process hanging during the publish set or hanging during the connection close and only retry if it's the former case.
Add a small, random delay before connecting to the RMQ server so all the different monitor processes don't hammer the server simultaneously.
Also added a command line option to enable the Pika debug messages (which are normally disabled even in verbose mode)
Execute Pika calls in a sub-process
The changes in the previous release exposed another problem with RabbitMQ and/or Pika: Pika's connection close function would sometimes hang. To fix this problem, this release uses the multiprocessing package to execute all Pika functions in a separate sub-process. If the close function (or any other Pika function) hangs, the main process will terminate the subprocess. The main process is also smart enough to resend the message to RMQ if necessary.
Fix dropped connections in publisher
The publisher processes were having problems with their connections to the RMQ server getting dropped. Exact cause is unknown, but it appears that the pika.BlockingConnection class doesn't do well with long term connections. (Notably, the subscriber process uses pika.SelectConnection and has not had problems with the connection dropping.)
The main change for this release is to move to a scheme where the publisher opens the connection to the RMQ server, sends one message, and then closes the connection down. Since we're not sending messages too often, the extra overhead from opening and closing connections is considered acceptable.
Update Job Stats Data
The python code now calculates (and reports) the deltas for read_samples, write_samples, read_sum and write_sum. These deltas are included in the data exported to Splunk. (It's easier to do the calculations in the python code than to try to do them as part of a Splunk query.)
Note: The raw counts are still included in the data, but the next version will probably remove them (because they just aren't very useful for Splunk queries).