You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 96 Next »

Template:Planned feature

Overview

This solution is about monitoring JobScheduler and its objects such as Jobs, Job Chains and Orders. Here you get an overview of how JobScheduler monitoring works. This feature will be available starting from general availability release 1.8.

These are some of the features of the architecture of this solution:

  • JobScheduler: The architecture establishes a partition between:
    • Detecting errors: A Job Chain analyses the JobScheduler logging and checks whether the monitored Job Scheduler objects had errors or warnings.
    • Sending alerts: Another Job Chain is responsible for sending the alerts to the corresponding System Monitor. The difference here, is that not all alerts are only incidents, but also events, as in occurrences, for example, the alert that a specific Job Chain was executed and which result it ended up with.
  • JobScheduler: This architecture allows to analyze the Log History of more than one JobScheduler.
  • System Monitor: JobScheduler is able to connect to more than one System Monitor at the same time.

Definitions

  • System Monitor: A System Monitor is an instrument to inform the Service Desk (1st Level Support) about incidents in IT systems. It does not serve for the analysis of the incidents, but merely for the information about the incidents, in order to be able to forward and scale these informations.
  • Passive Checks: These kind of checks are the ones that are sent remotely from an external host (from the point of view of a System Monitor) to the System Monitor. Otherwise, the ones that are carried out periodically by the System Monitor are called active checks.
  • Alerting: An Alert is an alarm, i.e. the message about an event. An alert does not provide every relevant information of an event, but it informs about the existence of the event. An alert can be either positive or negative.
  • Notification: The notification of a specific alert. Not every alert will be notified, just the ones that are so configured will be notified. Notifications are therefore a subset of the alerts and can be either positive or negative too.
  • Acknowledgement: Is the confirmation of an alert and it has the meaning, that the alert has been seen and/or is well known and the incident is trying to be recovered. An acknowledgement is always manually executed, that means, there is always someone that has realized there is a Critical service and this person acknowledges the services (usually by the Service Desk or 1st Level Support). It is never an automatized step.

Benefits

The benefits of the new solution are:

  1. There is no changes to be done in your JobScheduler configuration (Jobs, Job Chains, etc.) in order to get this solution working. You have to add the corresponding Job Chains for the monitoring but do not have to modify your current ones.
  2. The whole architecture lies at JobScheduler side and the solution is then independent from the monitor that the alerts are sent to. The solution works for every monitor that can receive passive checks.
  3. Processing of Jobs and Job Chains in JobScheduler is not affected or modified by the monitoring, neither in sense of performance nor in sense of stability.
  4. The level of detail in a message of a Service in the System Monitor is much higher with this solution. JobScheduler logs very exact what the error is about and this information is sent as a passive check to the specific Service, which shows the log message that JobScheduler logged.
  5. The criticality of an error is immediately recognized in the System Monitor. JobScheduler has all information about errors and this information is sorted out and sent to different Services in the System Monitor for every specific case. Through this feature, the Service Desk is immediately able to set its priority for recovering errors. For example, it does not have the same Criticality to recover an error of Performance (low) than when Documents could not be generated (high). Here you go a representation of this feature:

Functionality

  • Job Chain and Order Monitoring: Job Chains in JobScheduler can be with the new solution monitored. Actually, the elements that are monitored are the Orders that trigger these Job Chains.
  • History Notifications: Not only critical alerts are monitored, but also the positive ones. The history of a specific service is also monitored, to see exactly if a specific workflow was executed or not and what result it ended up with.
  • Timers for Job Chains: There are also Timers that measure the performance of a Job Chain. In case it takes too long for a Job Chain to end, a critical alert will be sent to a System Monitor.
  • Acknowledgment: Once a service in the System Monitor is critical, there is the possibility to acknowledge this service. That action will add an Order to the JobScheduler, so that JobScheduler does not send more notifications to the System Monitor for this service.

Installation and Configuration

As we have seen, the architecture lies at the JobScheduler side, therefore most of the installation and configuration has to be done at the JobScheduler side.

Database Tables

Three database tables have to be set at the JobScheduler database:

  • SCHEDULER_MON_NOTIFICATIONS
  • SCHEDULER_MON_RESULTS
  • SCHEDULER_MON_SYSTEM_NOTIFICATIONS

Java Program

The following JAR file has to be included in your <scheduler_home>/lib folder for JobScheduler:

com.sos.scheduler.notification.jar

jsendnsca-2.0.1.jar

This JARs are currently not included within the JobScheduler installation.

XML Schemas

XML Schemas and XML files (see examples below) have to be placed together at \config\notification. The schemas are the ones that define which values are allowed in your XML files for the JobScheduler monitoring. That means, you just have to modify your XML files in order to configure the JobScheduler objects you want to monitor and which System Monitor you want to use for that goal, but the XML schemas do not have to be modified.

Schema: CheckHistoryConfiguration_v1.0.xsd

Description:

  1. Specifies the JobScheduler objects that should be monitored for error and success conditions:
    1. Jobs, Job Chains and Orders eligible for monitoring.
    2. The monitoring of JobScheduler objects is independent from the System Monitor that will be used and applies to all notifications to System Monitors.
    3. For settings that are specific for notifications see the below configuration for System Monitors => Link
  2. Specifies checks to measure the performance of JobScheduler objects:
    1. Timers checks if timeouts for job and job chain execution have been exceeded or expired.

Example: CheckHistoryConfiguration.xml

Here you go an example of an XML file used for the monitoring of a specific JobChain:

  <?xml version="1.0" encoding="utf-8"?>
      <CheckHistoryConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="CheckHistoryConfiguration_v1.0.xsd">
          <MonitoredObject>
	      
              <JobChains>
	          <JobChain name="samples/sample_jobChain_1"/>
                  <JobChain name="samples/sample_jobChain_2"/>
                  <JobChain name="samples/sample_jobChain_3"/>
               </JobChains>
               
               
	       <Timers>
                   <Timer>
		   <!-- 
                     configure job chains and expected maximum execution time for performance measurement
                
                     impact: if the execution time of current order in job chain samples/sample_jobChain_1 is greater as 10 seconds,
                     the current order will be set as performance problem. 
                   -->
                     <JobChains>
		         <JobChain name="samples/sample_jobChain_1" step_from="100" step_to="100"/>
	             </JobChains>
                     <Maximum><Script language="javascript"><![CDATA[10]]></Script></Maximum>
		   </Timer>
	       </Timers>
          </MonitoredObject>
  </CheckHistoryConfiguration>

<!--
Now in order to configure the XML file and following the description from above (about the schema):

  1. Specify the JobScheduler that should be monitored: Job Scheduler with "MY_JOB_SCHEDULER_ID"
  2. Specify the JobScheduler objects that should be monitored: Job Chain "MY_JOB_CHAIN_NAME"
  3. Specify Timer for your JobChain: Timer for Job Chain "MY_JOB_CHAIN_NAME" (moreover a function that calculates the expiration time for the timer)
    -->
    Explanation
  • MonitoredObject/JobChains can contain several JobChain definitions for monitoring of error or success conditions
    • JobChain has the following attributes
      • scheduler_id (optional) - JobScheduler instance with the given identification. By default - JobChain will be checked in all JobScheduler instances that logged into the same database
      • name (optional) - Job chain name including possible folder names. By default - all JobChains for defined scheduler_id are checked
      • step_from (optional) - Start Job node name for checking
      • step_to (optional) - End Job node name for checking
  • MonitoredObject/Timers can contain several Timers definitions for performance measurement
    • MonitoredObject/Timers/Timer has the following elements
      • JobChains (optional) - can contain several JobChain definitions for performance measurement
      • JobChain has the following attributes
        • scheduler_id (optional) - JobScheduler instance with the given identification. By default - JobChain will be checked in all JobScheduler instances that logged into the same database
        • name (optional) - Job chain name including possible folder names. By default - all JobChains for defined scheduler_id are checked
        • step_from (optional) - Start Job node name for checking
        • step_to (optional) - End Job node name for checking
      • Minimum (optional) - expected minimum execution time for all configured job chains in the MonitoredObject/Timers/Timer/JobChains
        • Script (optional) - definition of the expected minimum value and has the following attributes
          • language (optional) - script engine. currently javascript engine will be supported
      • Maximum (optional) - expected maximum execution time for all configured job chains in the MonitoredObject/Timers/Timer/JobChains
        • Script (optional) - definition of the expected maximum value and has the following attributes
          • language (optional) - script engine. currently javascript engine will be supported

Sample Timer configuration using order parameter to calculate expected execution time

  ....
  
  <Timer>
      <!-- 
      configure job chains and expected maximum execution time for performance measurement
                
      impact: if the execution of job chain samples/sample_jobChain_1 is greater as calculated time (in seconds),
      the order will be set as performance problem.
                
      The calculation uses the order parameter FILE_SIZE.
                
      Parameter FILE_SIZE must be configured on the appropriate step in a job chain (using StoreResultsJobJSAdapterClass as monitor) for storing in the database.
      -->
      <JobChains>
          <JobChain name="samples/sample_jobChain_1"/>
      </JobChains>
      <Maximum>
          
          <Script language="javascript"><![CDATA[
              function calculate()\{
                  var fileSize		       = new java.lang.Double(%FILE_SIZE%);
                  var timerExpiryFactor       = 0.0025;
                  var timerExpiryTolerance    = timerExpiryFactor*0.1;
                  var timerExpiry 	       = new java.lang.Double(timerExpiryFactor+timerExpiryTolerance);
                  timerExpiry 		       = timerExpiry*fileSize*60;
              return timerExpiry;
              \} 
              calculate();
              ]]></Script>
      </Maximum>
  </Timer>
  ...     

Schema: SystemMonitorNotification_v1.0.xsd

Description:

  1. Configuration for Notifications to a specific System Monitor
  2. Specifies the type of notification: "service_name_on_error" or "service_name_on_success"
  3. Specifies the service status of notification: "service_status_on_error" or "service_status_on_success"
  4. Specifies then the EXACT name of the service (the way it is named at the System Monitor)
  5. Specifies the EXACT hostname for the host the notification are sent from (the way it is named at the System Monitor)
  6. Specifies the port the application to receive passive checks is running on
  7. Specifies the hostname of the System Monitor, that is the hostname for the host the notification are sent from
  8. Define the type of encryption is used to send the information to the System Monitor
  9. Specifies how many notifications have to be sent to the System Monitor for a specific Job Scheduler object
  10. The same as above in case there is configured a Timer for this Job Scheduler object

Example SystemMonitorNotification_op5.xml

Here you go an example of an XML file used for notifying a specific System Monitor (op5 Monitor) and using NotificationCommand:

<?xml version="1.0" encoding="utf-8"?>
<SystemMonitorNotification xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="SystemMonitorNotification_v1.0.xsd">
    <Notification>
    <!--
        configure system monitor service name and command for send notification to OP5 system monitor using nsca client
        
        notification command substitution in this case:
        
        All Environment variables   e.g. %TEMP% or %JAVA_HOME%        

        %SERVICE_NAME%              Error Service (service_name_on_error)
        
        %SERVICE_STATUS%            1 if error occured      (service_status_on_error)
                                    0 if error recovered    (service_status_on_success)
        
        %SERVICE_MESSAGE_PREFIX%    ERROR       if error occured
                                    RECOVERED   if error recovered
                                    TIMER       if performance check
        
        %ORDER_HISTORY_ID% ...      table field name of result row for building message (see table definition SCHEDULER_MON_NOTIFICATIONS)   
        -->
        <NotificationMonitor service_name_on_error="Error Service" service_status_on_error="1" service_status_on_success="0">
            <NotificationCommand>
<![CDATA[cmd /c echo my_nsca_service_host:%SERVICE_NAME%:%SERVICE_STATUS%:%SERVICE_MESSAGE_PREFIX%history id=%ORDER_HISTORY_ID%, step =%ORDER_STEP_STATE%, error=%ERROR_TEXT%, check = %CHECK_TEXT% | C:\nsca\send_nsca.exe -H nsca_server_host -c C:\nsca\send_nsca.cfg -d : ]]>
            </NotificationCommand>
	    </NotificationMonitor>
		
            <NotificationObject>
	    <!-- 
            configure job chains and number of send operations for same problem for sending error notifications 
            
            requirement: monitoring of this job chains must be configured in CheckHistoryConfiguration.xml
            -->
            <JobChains>
                <JobChain notifications="10" name="samples/sample_jobChain_1"/>
                <JobChain notifications="10" name="samples/sample_jobChain_2"/>
            </JobChains>
	           
	    <Timers>
	        <Timer>
		<!-- 
                    configure job chains and number of send operations to same check 
                    
                    requirement: timer check for this job chain must be configured in CheckHistoryConfiguration.xml
                -->
                    <JobChains>
		        <JobChain notifications="1" name="samples/sample_jobChain_1"/>
		    </JobChains>
         	</Timer>
	    </Timers>
	</NotificationObject>
    </Notification>
</SystemMonitorNotification>

Explanation

  • SystemMonitorNotification can contain several Notification definitions for notification of error or success conditions
    • Notification contain one NotificationMonitor
      • NotificationMonitor contains the configuration for delivery notifications to System Monitor and has the following attributes
        • service_name_on_error (optional) - Service name to send of error/recovery messages
        • service_name_on_success (optional) - Service name to send of success messages if order is successfully completed
        • service_status_on_error (optional) - Service status (e.g. CRITICAL or WARNING) to send of error messages. If not set - CRITICAL will be sended
        • service_status_on_success (optional) - Service status (e.g. SUCCESS) to send of success messages. If not set - OK will be sended
      • NotificationMonitor can has one of the following elements
        • NotificationCommand command line for calling of the extern script for system notification
        • NotificationInterface calling API for system notification (currently for NSCA notifications). This Element has the following attributes
          • service_host (required) - hostname for the host the notification are sent from (the way it is named at the System Monitor)
          • monitor_port (required) - port of System Monitor to receive notifications
          • monitor_host (required) - hostname of System Monitor
          • monitor_encryption (required) - specifies that the communication with the System Monitor is encrypted. NONE, XOR, TRIPLE_DES encryptions are available.
        • NotificationObject contains the configuration of objects, which will be sended to System Monitor
          • JobChains (optional) - can contain several JobChain definitions
            • JobChain (optional) - can contain several JobChain definitions
              • JobChain has the following attributes
                • notifications (optional) - Number of notifications for the same problem. By default - 1
                • scheduler_id (optional) - JobScheduler instance with the given identification. By default - JobChain will be checked in all JobScheduler instances that logged into the same database
                • name (optional) - Job chain name including possible folder names. By default - all JobChains for defined scheduler_id are checked
                • step_from (optional) - Start Job node name for checking
                • step_to (optional) - End Job node name for checking
          • Timers (optional) - can contain several Timer definitions
            • Timer has the following elements
              • JobChains (optional) - can contain several JobChain definitions for performance notificatio
                • JobChain has the following attributes
                  • notifications (optional) - Number of notifications for the same check. By default - 1
                  • scheduler_id (optional) - JobScheduler instance with the given identification. By default - JobChain will be checked in all JobScheduler instances that logged into the same database
                  • name (optional) - Job chain name including possible folder names. By default - all JobChains for defined scheduler_id are checked
                  • step_from (optional) - Start Job node name for checking
                  • step_to (optional) - End Job node name for checking
                    Sample Notification configuration using NotificationInterface
                      ....
                      <!--
                           notification message substitution in this case:
                     
                            All Environment variables   e.g. %TEMP% or %JAVA_HOME%        
                       
                            %ORDER_HISTORY_ID% ...      table field name of result row for building message (see table definition SCHEDULER_MON_NOTIFICATIONS)   
                            -->
                      <NotificationMonitor service_name_on_error="Error Service">
                          <NotificationInterface service_host="my_nsca_service_host" monitor_port="5667" monitor_host="nsca_server_host" monitor_encryption="XOR">
                          order history id=%ORDER_HISTORY_ID%, job chain=%JOB_CHAIN_NAME%, order id=%ORDER_ID%, step =%ORDER_STEP_STATE%, error=%ERROR_TEXT%, check = %CHECK_TEXT%
                          </NotificationInterface>
                    	
                      ...     
                    
                    <!--
                    For this concrete example and following the description from above (about the schema):
  1. Configure for Notifications to a specific System Monitor: op5 Monitor
  2. Specify the type of notification: "service_name_on_error" or "service_name_on_success"
  3. Specify then the EXACT name of the service (the way it is named at the System Monitor): "OMS Mass Processing Problem Tracking"
  4. Specify the EXACT hostname for the host the notification are sent from (the way it is named at the System Monitor): "OMS Interfacing Server"
  5. Specify the port the application to receive passive checks is running on: "5667"
  6. Specify the hostname of the System Monitor, that is the hostname for the host the notification are sent from: "dipsy.sos"
  7. Define the type of encryption is used to send the information to the System Monitor: "XOR"
  8. Specify how many notifications have to be sent to the System Monitor for a specific Job Scheduler object:"20" notfications for "MY_JOB_CHAIN_NAME"
  9. The same as above in case there is configured a Timer for this Job Scheduler object:"20" notfications for the Timer for "MY_JOB_CHAIN_NAME"
    -->

    Job Chains

    Job Chains for these solutions have to be placed under \live\notification. Four Job Chains were implemented for this solution and they have the following functions:
  • CheckHistory: reads JobScheduler database tables where the logging is placed, analyses them and writes results into another tables, the Notification tables.
  • CleanupNotifications: deletes entries in the Notification tables. Currently this takes place once every day.
  • ResetNotifications: sets Status for Notifications in the Notification tables (e.g. Acknowledge)
  • SystemNotifier: responsible for notifiying the system Monitor about the current notifications. Moreover, this JobChain is responsible for updating the Notification tables after having notified the System Monitor.

System Monitor

  1. The System Monitor receives just passive checks, that means, there are no active checks for monitoring JobScheduler. The only configuration here is the capability to receive passive checks from a remote host.
  2. The services in the System Monitor have to be in concordance with the JobScheduler configuration. Passive checks (services) have to be configured and named following the convention used in the XML described above for the JobScheduler (CheckHistoryConfiguration.xml and SystemMonitorNotification_op5.xml).

Use Cases

Recoverable Errors

Initial Situation: A Job Chain is triggered by directory monitoring. That is, when a certain file comes in a monitored folder, the Job Chain starts.

Problem: The Job Chain ended with error.

Handling: The System Monitor will be notified to the service related to the Job Chain with the message error. If a new execution of the Job Chain from a new file end without errors, does not mean that the error is recovered, since the file that has been processed is now another one. That is, the error message at the System Monitor will stay till the same file is again placed in the monitored directory and the Job Chain ends without errors.

Configuration:

  • XML CheckConfigurationHistory.xml: Indicate the ID of the JobScheduler and the name of the Job Chain you want to monitor.
  • XML SystemMonitorNotification.xml: Specify the name of the Service (in the System Monitor) and specify that it is about a service_name_on_error since you want to have the control when the Job Chain ends in an error.
  • System Monitor: Services in the System Monitor have to be configured and named the same way as in the XML file above SystemMonitorNotification.xml.

Workflow Execution takes too long

Initial Situation: A Job Chain is triggered and it could not end, it hanged in a step, taking then longer than expected.

Problem: Execution time was too long

Handling: A timer for this Job Chain is set and the System Monitor will be notified about it. The expiration times for the Job Chains are configured with enough time for processing, that means, this is usually used for cases where the Job Chain hanged in a specific step.

Configuration:

  • XML CheckConfigurationHistory.xml: As in the example above, indicate the ID of the JobScheduler and the name of the Job Chain you want to monitor. Moreover, specify the timer for this specific job chain and the function to calculate the expiration time for the timer.
  • XML SystemMonitorNotification.xml: As in the example above, specify the name of the Service (in the System Monitor) and specify that it is about a service_name_on_error since you want to have the control when the Job Chain ends in an error. Moreover and essential for this particular case, specify how many times the timer should notify your System Monitor about the expiration of a timer.
  • System Monitor: As in the example above, Services in the System Monitor have to be configured and named the same way as in the XML file above SystemMonitorNotification.xml.

SFTP connection refused

Initial Situation: There is a Job Chain that uses SFTP for transferring files. You have a setback configured in this step of the Job Chain, so that if the connection to the SFTP server fails, this step is retried after some time.

Problem: The SFTP server is not available anymore.

Handling: The System Monitor will be notified to the service related to the Job Chain with the message error. However, you don't want to have a bunch of notifications for a Job Chain when is an external factor, the connection to the SFTP Server, what is producing the error.

Configuration:

  • XML CheckConfigurationHistory.xml: As in the example above, indicate the ID of the JobScheduler and the name of the Job Chain you want to monitor.
  • XML SystemMonitorNotification.xml: As in the example above, specify the name of the Service (in the System Monitor) and specify that it is about a service_name_on_error since you want to have the control when the Job Chain ends in an error. Moreover and very important in this case, specify how many times this Job Chain should notify your System Monitor about the error connecting to the SFTP Server. You can use step_from and step_to for that in order to reduce the number of notifications for this specific step.
  • System Monitor: As in the example above, Services in the System Monitor have to be configured and named the same way as in the XML file above SystemMonitorNotification.xml.

Thresholds

Initial Situation: For example, a specific number of Workflow Executions have to be executed successfully till some specific time. That is, a specific value has to be monitored in order to determine if this quote was reached.

Handling: A new service for History is configured, so that the workflow executions (Job Chains in the JobScheduler vocabulary) send the information that they were executed and finished to the System Monitor.

Configuration:

  • XML CheckConfigurationHistory.xml: As in the example above, indicate the ID of the JobScheduler and the name of the Job Chain you want to monitor.
  • XML SystemMonitorNotification.xml: Specify the name of the Service (in the System Monitor) but now specify that it is about a service_name_on_success since you want to have the control when the Job Chain ends in an success, and not only when it ends on error.
  • System Monitor: As in the example above, Services in the System Monitor have to be configured and named the same way as in the XML file above SystemMonitorNotification.xml.

Acknowledgement

Initial Situation: An alert for a Service has been sent to the System Monitor and a Mail has been sent to the Service Desk (Support Team) notifying about it.

Handling: The problem is well known by the Service Desk and the "acknowledge" the problem. Through the acknowledgement JobScheduler will be notified to and will not send any more notification for this Service to the System Monitor till the Service is again recovered.

Configuration:

  • System Monitor: The step of notifying JobScheduler through an acknowledgement in the System Monitor is an execution of a script. This is nothing else than a notification, like sending a mail for instance, but instead, another action is executed, which is the execution of the script that contacts JobScheduler and add an order to the JobChain ResetNotifications described above.
  • No labels