JobScheduler Monitoring Interface

Overview

This solution is about monitoring JobScheduler and its objects such as Jobs, Job Chains and Orders. Here you go some of the features of the architecture of this solution:

JobScheduler: The architecture establishes a partition between:
- Detecting errors: A Job Chain analyses the JobScheduler logging and checks whether the monitored Job Scheduler objects had errors or warnings.
- Sending alerts: Another Job Chain is responsible for sending the alerts to the corresponding System Monitor. The difference here, is that not all alerts are only incidents, but also events, as in occurrences, for example, the alert that a specific Job Chain was executed and which result it ended up with.
JobScheduler: This architecture allows to analyze the Log History of more than one JobScheduler.
System Monitor: JobScheduler is able to connect to more than one System Monitor at the same time.

Definitions

System Monitor: A System Monitor is an instrument to inform the Service Desk (1st Level Support) about incidents in IT systems. It does not serve for the analysis of the incidents, but merely for the information about the incidents, in order to be able to forward and scale these informations.

Passive Checks: These kind of checks are the ones that are sent remotely from an external host (from the point of view of a System Monitor) to the System Monitor. Otherwise, the ones that are carried out periodically by the System Monitor are called active checks.

Alerting: An Alert is an alarm, i.e. the message about an event. An alert does not provide every relevant information of an event, but it informs about the existence of the event. An alert can be either positive or negative.

Notification: The notification of a specific alert. Not every alert will be notified, just the ones that are so configured will be notified. Notifications, as a subset of the alerts can be either positive or negative too.

Acknowledgement: Is the confirmation of an alert and it has the meaning, that the alert has been seen and the incident is trying to be recovered. An acknowledgement is always manually executed, that means, there is always a person that has realized there is a Critical service and this person acknowledges the services. It is never an automatized step.

Benefits

The benefits of the new solution are:

There is no changes to be done in your JobScheduler configuration (Jobs, Job Chains, etc.) in order to get this solution working. You have to add the corresponding Job Chains for the monitoring but do not have to modify your current ones.
The whole architecture lies at JobScheduler side and the solution is then independent from the monitor that the alerts are sent to. The solution works for every monitor that can receive passive checks.
Processing of Jobs and Job Chains in JobScheduler is not affected or modified by the monitoring, neither in sense of performance nor in sense of stability.

Functionality

Job Chain and Order Monitoring: Job Chains in JobScheduler can be with the new solution monitored. Actually, the elements that are monitored are the Orders that trigger these Job Chains.

History Notifications: Not only critical alerts are monitored, but also the positive ones. The history of a specific service is also monitored, to see exactly if a specific workflow was executed or not and what result it ended up with.

Timers for Job Chains: There are also Timers that measure the performance of a Job Chain. In case it takes too long for a Job Chain to end, a critical alert will be sent to a System Monitor.

Acknowledgment: Once a service in the System Monitor is critical, there is the possibility to acknowledge this service. That action will execute a script that will add an Order to the JobScheduler, concretely to the Job Chain "Reset Notifications" (see below Chapter Configuration - Job Chains).

Installation and Configuration

As we have seen, the architecture lies at the JobScheduler side, therefore most of the installation and configuration has to be done at the JobScheduler side.

Database Tables

Three database tables have to be set at the JobScheduler database:

SCHEDULER_MON_NOTIFICATIONS
SCHEDULER_MON_RESULTS
SCHEDULER_MON_SYSTEM_NOTIFICATIONS

Java Program

A JAR file has to be included in your \lib folder in JobScheduler:

com.sos.scheduler.notification-xxx.jar (xxx for the version number)

This JAR is currently not included within the JobScheduler installation.

XML Schemas

Schemas have to be placed at \config\notification

CheckHistoryConfiguration_v1.0.xsd

Description:

Specifies the JobScheduler that should be monitored
Specifies the JobScheduler objects that should be monitored
Timers check the job and job chain execution for timeouts

Example: CheckHistoryConfiguration.xml

Here you go an example of an XML file used for the monitoring of a specific JobChain:

   <?xml version="1.0" encoding="utf-8"?>
      <CheckHistoryConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="CheckHistoryConfiguration_v1.0.xsd">
	  <MonitoredObject>
		 <JobChains>
			<JobChain scheduler_id="MY_JOB_SCHEDULER_ID" name="MY_JOB_CHAIN_NAME"/>
		 </JobChains>
		<Timers>
                   <Timer>
			<JobChains>
			        
					<JobChain name="MY_JOB_CHAIN_NAME" />
			</JobChains>
               
                
			 	<Maximum><Script language="javascript"><![CDATA[
					function calculate()\{
						var fileSize				= new java.lang.Double(%FILE_SIZE%);
						var timerExpiryFactor 		        = 0.0025;
						var timerExpiryTolerance 	        = timerExpiryFactor*0.1;
						var timerExpiry 			= new java.lang.Double(timerExpiryFactor+timerExpiryTolerance);
						timerExpiry 				= timerExpiry*fileSize;
					return timerExpiry;
					\} 
					calculate();
				]]></Script></Maximum>
		      </Timer>
		</Timers>
          </MonitoredObject>
  </CheckHistoryConfiguration>

Following the description above:

Specifies the JobScheduler that should be monitored: Job Scheduler with "MY_JOB_SCHEDULER_ID"
Specifies the JobScheduler objects that should be monitored: Job Chain "MY_JOB_CHAIN_NAME"
Timers check the job and job chain execution for timeouts: Timer for Job Chain "MY_JOB_CHAIN_NAME" (moreover a function that calculates the expiration time for the timer)

SystemMonitorNotification_v1.0.xsd

Description:

Configuration for Notifications to a specific System Monitor
Specifies the type of notification: "service_name_on_error" or "service_name_on_success"
Specifies then the EXACT name of the service (the way it is named at the System Monitor)
Specifies the EXACT hostname for the host the notification are sent from (the way it is named at the System Monitor)
Specifies the port the application to receive passive checks is running on
Specifies the hostname of the System Monitor, that is the hostname for the host the notification are sent from
Define the type of encryption is used to send the information to the System Monitor
Specifies how many notifications have to be sent to the System Monitor for a specific Job Scheduler object
The same as above in case there is configured a Timer for this Job Scheduler object

Example SystemMonitorNotification_op5.xml

Here you go an example of an XML file used for notifying a specific System Monitor (op5 Monitor):

 <?xml version="1.0" encoding="utf-8"?>
  <SystemMonitorNotification xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SystemMonitorNotification_v1.0.xsd">
	<Notification>
	    <NotificationMonitor service_name_on_error="OMS Mass Processing Problem Tracking">
            <NotificationInterface service_host="OMS Interfacing Server" monitor_port="5667" monitor_host="dipsy.sos" monitor_encryption="XOR">order history id=%ORDER_HISTORY_ID%, job chain=%JOB_CHAIN_NAME%, order id=%ORDER_ID%, step =%ORDER_STEP_STATE%, error=%ERROR_TEXT%, check = %CHECK_TEXT%</NotificationInterface>
		</NotificationMonitor>
		
                  <NotificationObject>
			<JobChains>
				<JobChain scheduler_id="scheduler" notifications="20" name="MY_JOB_CHAIN_NAME"/> 
			</JobChains>
			<Timers>
				<Timer>
					<JobChains>
						<JobChain notifications="20" name="MY_JOB_CHAIN_NAME"/>
					</JobChains>
				</Timer>
			</Timers>
		  </NotificationObject>
	</Notification>
	
	<Notification>
	    <NotificationMonitor service_name_on_success="OMS Mass Processing History">
             <NotificationInterface service_host="OMS Interfacing Server" monitor_port="5667" monitor_host="dipsy.sos" monitor_encryption="XOR">order history id=%ORDER_HISTORY_ID%, job chain=%JOB_CHAIN_NAME%, order id=%ORDER_ID%</NotificationInterface>
		</NotificationMonitor>
		
                <NotificationObject>
			<JobChains>
				<JobChain scheduler_id="scheduler" notifications="20" name="MY_JOB_CHAIN_NAME"/>
			</JobChains>
		</NotificationObject>
	</Notification>
  </SystemMonitorNotification>

For this concrete example and following the description from above (about the schema):

Configuration for Notifications to a specific System Monitor: op5 Monitor
Specifies the type of notification: "service_name_on_error" or "service_name_on_success"
Specifies then the EXACT name of the service (the way it is named at the System Monitor): "OMS Mass Processing Problem Tracking"
Specifies the EXACT hostname for the host the notification are sent from (the way it is named at the System Monitor): "OMS Interfacing Server"
Specifies the port the application to receive passive checks is running on: "5667"
Specifies the hostname of the System Monitor, that is the hostname for the host the notification are sent from: "dipsy.sos"
Define the type of encryption is used to send the information to the System Monitor: "XOR"
Specifies how many notifications have to be sent to the System Monitor for a specific Job Scheduler object:"20" notfications for "MY_JOB_CHAIN_NAME"
The same as above in case there is configured a Timer for this Job Scheduler object:"20" notfications for the Timer for "MY_JOB_CHAIN_NAME"

Job Chains

Job Chains for these solutions have to be placed under \live\notification. Four Job Chains were implemented for this solution and they have the following functions:

CheckHistory: reads JobScheduler database tables where the logging is placed, analyses them and writes results into another tables, the Notification tables.
CleanupNotifications: deletes entries in the Notification tables. Currently this takes place once every day.
ResetNotifications: sets Status for Notifications in the Notification tables (e.g. Acknowledge)
SystemNotifier: responsible for notifiying the system Monitor about the current notifications. Moreover, this JobChain is responsible for updating the Notification tables after having notified the System Monitor.

System Monitor

The monitoring tool receives just passive checks, that means, there are no active checks for monitoring JobScheduler. The only configuration here is the capability to receive passive checks from a remote host.
The services in the System Monitor have to be in concordance with the JobScheduler configuration. Passive checks (services) have to be configured and named following the convention used in JobScheduler.

Use Cases

Recoverable Errors

Initial Situation: A Job Chain is triggered by directory monitoring. That is, when a certain file comes in a monitored folder, the Job Chain starts.

Problem: The Job Chain ended with error.

Handling: The System Monitor will be notified to the service related to the Job Chain with the message error. The error message at the System Monitor will stay till the same file is again placed in the monitored directory and the Job Chain ends without errors. That means, that a new file makes the Job Chain end without errors, does not mean that the error is recovered, since the file that has been processed is now another one.

Workflow Execution takes too long

Initial Situation: A Job Chain is triggered and it could not end, it hanged in a step, taking then longer than expected.

Problem: Execution time was too long

Handling: A timer for this Job Chain is set and the System Monitor will be notified about it. The expiration times for the Job Chains are configured with enough time for processing, that means, this is usually used for cases where the Job Chain hanged in a specific step.

SFTP connection refused

Initial Situation: There is a Job Chain that uses SFTP for transferring files. You have a setback configured in this step of the Job Chain, so that if the connection to the SFTP server fails, this step is retried after some time.

Problem: The SFTP server is not available anymore.

Handling: The System Monitor will be notified to the service related to the Job Chain with the message error. (to be completed)

Space shortcuts

Page tree