Job Scheduler and Monitoring with Nagios
Introduction:
JobScheduler executes jobs and can, e.g. in case of an error, inform the persons that are responsible per e-mail. In an environment where Nagios is used to monitor the processes, it is recommended to use the method of advice that is provided by Nagios itself.
This document describes, how JobScheduler messages can be written directly into the Nagios console. For example:
- In case of error
- In case of success
- Monitoring of single jobs and job chains
- Monitoring if a particular job has run (successfully) at a specific time
System Environment
Nagios and JobScheduler can run on different servers. The communication takes place via the Nagios add-in NSCA. For the monitoring of Windows systems, a Shell script is called via SSH and writes directly into the command pipe of Nagios.
Preconditions
- NSCA demon has to be installed on the Nagios server
- NSCA client has to be installed on the Job Scheduler Server
- Shell script for the description of the command pipe has to be installed on the Nagios Server (Windows.)
Job Scheduler Configurations
The following job chains have to be installed
- nsca_communication - for communication with NSCA
- send_nsca_nagios - calls send_nsca
- set_order_state - marks the end status for the NSCA order
CheckJobRun - checks if a particular job has run.
- error/ job_check_job_run - checks if a particular job has run successfully.
- error/ add_nagios_alert - to generate an order for nsca_communication
- sample_job_chain_with_errorHandling_at_the_end_by_nsca An example job chain
- sample_order:_jobr sets an exit code
- error/prepare_error.job sets the error message for the order to NSCA
- error/ add_nagios_alert to create an order for nsca_communication
- error/set_error sets the orders error state
Installation:
For the installation the file JobScheduler_nagios.tar.gz hat to be unzipped into the file scheduler/config/live. The following files are created:
The files:
- nsca_communicationNSCA.job_chain.xml
- set_order_state.job.xml
- sendEvent2NagiosNSCA.job.xml
are copied into the directory:scheduler/config/live
Job Chain for Communication with Nagions (Windows) SSH
The files:
- nagios_communicationSSH.job_chain.xml
- set_order_state.job.xml
- sendEvent2NagiosSSH.job.xml
are copied into the directory scheduler/config/live
Testing if a Job has run successfully
The files:
- CheckJobRun.job_chain.xml
- job_check_job_run.job.xml
- add_nagios_alert.job.xml
are copied into the directory scheduler/config/live/error_handling
Error Treatment for a Job
The file handle_exit_code.js is copied into the directory scheduler/config/live/error_handling
Error description for ..... in a Job Chain
The files:
- prepare_error.job.xml
- set_error.job.xml
are copied into the directory scheduler/config/live/error_handling
Functionality:
Job Chain for Communication with Nagios (Linux) NSCA
This job chain triggers the call of the NSCA client.
The job sendEvent2NagiosNSCA calls the NSCA client parametrized. The job set_order_state sets the status of the order according to the parameter par_severity. Immediately the order history shows in the operations GUI the type of job that has been executed.
Job Chain for Communication with Nagios (Windows) SSH
This job chain triggers the call of the Shell script, which writes directly into the Nagios command pipe.
The job sendEvent2NagiosSSH is parametrised for the SSH connection. The job set_order_state sets the status of the order according to the parameter par_severity, . Immediately the order history shows in the operations GUI the type of job that has been executed.
Testing for the successful Run of a Job
The job chain CheckRun is used. An order is created for each job run that has to be checked.
As an example, an order has been implemented, this example checks if the job Job Run SampeJobNscaNotification has run successfully.
Error Treatment for a Job
In order to check a single job, post processing can be implemented. In this case a message is send to Nagios.
Error Treatments with Knots in a Job Chain
- The job chain sample_job_chain_with_errorHandling_at_the_end_by_nsca shows error handling if the messages are send to Nagios.
- The job sample_order_job2 simulates an error
prepare error:
sets the order parameter "par_message" to the value <jobkette>/<order_id> RC = <exit_code>
add_nagios_alert
adds an order to a job chain for communication with Nagios. Par_severity is set according to the Exit Code. This part can be individual for each customer.
If the parameter par_service is not set, then the service on Scheduler Errors, Scheduler Warnings or Scheduler Success is set according to the value of par_severity.
Set_error
- Sets the errors status to error. This step takes care that erroneous orders are marked
- Sets the state to error. This script is necessary to mark orders with errors also when the error handling was successful.
Nagios Configuration
Setting up a Host for each Job Scheduler
A host is set up for each server with a JobScheduler. If several Job Schedulers are installed on one host, one host configuration is used for all JobSchedulers.
define host\{ use generic-host ; Name of host template to use host_name yourHostName alias aliasHost address 192.11.0.100 check_command check-host-alive max_check_attempts 10 notification_interval 120 notification_period 24x7 notification_options d,r contact_groups admins \}
Within the Nagios configuration services have to be set up
Service einrichten
A generic, re-usable service is set up. Here it is important to activate the passive checks and deactivate the active. Since defining the service is obligatory a dummy is inserted.
define service\{ use generic-service name passive_service active_checks_enabled 0 passive_checks_enabled 1 # We want only passive checking flap_detection_enabled 0 register 0 # This is a template, not a real service is_volatile 0 check_period 24x7 max_check_attempts 1 normal_check_interval 5 retry_check_interval 1 check_freshness 0 contact_groups admins check_command check_dummy!0 notification_interval 120 notification_period 24x7 notification_options w,u,c,r stalking_options w,c,u \}
define command\{ command_name check_dummy command_line $USER1$/check_dummy $ARG1$ \}
Basically it is sufficient to set up one more service, which collects all the messages from JobScheduler. It is recommended to distribute the messages to several services, since otherwise messages of success, e.g. in case of error could be overwritten. For example, a simple split-up could be the separation according to the type of message (error, warning, success). It is also possible to set up a service for a particular job or job chain.
Messages for a particular Job or Job Chain
define service\{ use passive_service host_name yourHostName service_description Job Run SampeJobNscaNotification \}
Warnings, Errors and Success Messages
define service\{ use passive_service host_name yourHostName service_description Scheduler Warnings \}
define service\{ use passive_service host_name yourHostName service_description Scheduler Errors \}
define service\{ use passive_service host_name yourHostName service_description Scheduler Messages
Script to write into the command pipe (only if Windows-Server are monitored)
If the messages from Job Scheduler (on Windows) are to be written into the Nagios console, then a Shell script has to be installed on the Nagios server. This script writes into the command pipe of Nagios. The script is executed via SSH.
Read more here:
Sources:
- NSCA Installation: http://nagios.sourceforge.net/download/contrib/documentation/misc/NSCA_Setup.pdf
- NSCA Download: http://www.nagios.org/download/addons
- Shell Script sendEvent2Nagios http://nagios.sourceforge.net/download/contrib/misc/sendevent2nagios/
- File with the examples: [ JobScheduler_nagios.tar.gz JobScheduler_nagios.tar.gz|http://www.sos-berlin.com/download/JobScheduler_nagios.tar.gz]