Introduction

The JITL MonitoringJob template can be used to perform health checks of JS7 JOC Cockpit, Controller and Agents. Health check results can be forwarded, for example by mail.

  • Users can use health status results for integration with their monitoring system.
  • SOS offers a 24/7 Monitoring Service to receive health status results for customers using a commercial license and who subscribe to this support option, see JS7 - License.

The JITL MonitoringJob template can be used as a building block in a monitoring solution to:

  • repeatedly run the MonitoringJob template using a JS7 - Cycle Instruction,
  • forward health check results to a monitoring solution.
    • When used with a user's monitoring solution, this can include forwarding health check report files.
    • This can include sending e-mails containing a notice or an alert to SOS. Such notices do not include any data related to the user's JS7 environment, they only indicate a notice or alert.
      • Alert mails are simplistic like this

The job template makes use of the JS7 - REST Web Service API to retrieve information from the JOC Cockpit.

FEATURE AVAILABILITY STARTING FROM RELEASE 2.4.1

Usage

When defining the job either:

  • invoke the Wizard that is available from the job properties tab in the Configuration view and select the JITL MonitoringJob and relevant arguments from the Wizard

or

  • specify the JITL job class and com.sos.jitl.jobs.monitoring.MonitoringJob Java class name and add required arguments.

Example

Download (upload .json): pdmMonitoring.workflow.json

Using the Example

It is recommended to use the example as a starting point and to adjust the parameterization:


Explanation:

  • A JS7 - Cycle Instruction is used in order to repeatedly perform health status checks.
    • Users should adjust cycles to their monitoring needs.
  • JS7 - Retry Instruction is used in order to retry execution, for example of the MailJob included in case that e-mail cannot be sent.
  • The MonitoringJob is used to perform the health status check.
  • The MailJob is used to send notices and alerts by mail. This is an option - users might apply other means to forward notices and alerts.


The Cycle Instruction is configured like this:


Explanation:

  • A ticking cycle is used in order to perform health status checks precisely at the given hour and minute.
  • The cycle runs in hourly intervals for any days of week.
  • The cycle period starts at midnight and lasts 24 hours.
  • This example results in 24/7 coverage with the health status check being performed every hour.


The Retry Instruction is configured like this:


Explanation:

  • If any of the jobs included in the Retry Instruction fails then execution is resumed starting from the first job.
  • Execution is repeated up to 3 times unless successful. The same interval of 1 minute is applied for each retry.


The MonitoringJob makes use of arguments that are explained with chapter Using the Job Wizard for the MonitoringJob.

The MailJob is explained from the JS7 - JITL MailJob article.

Using the Job Wizard for the MonitoringJob

You can use the job wizard like this:


Explanation:

  • Add an empty job from the instruction panel.
  • Specify a name and a label for the job.
  • Select an Agent.

In a next step invoke the job wizard that you find in the upper right corner of the job property editor. The wizard brings up the following popup window:


Explanation:

  • From the list of available job templates select the MonitoringJob.

Then hit the "Next" button to make the job wizard display available arguments:


Explanation:

  • controller_id: Optionally specifies the identification of the Controller to be checked. By default the current Controller is used.
  • monitor_report_dir: Specifies the directory in which the job will store health status report files (.json). The directory has to exist prior to running the job and has to be in reach of the Agent that runs the job. 
    • An absolute or a relative path can be specified.
    • An expression can be used. The example makes use of env('JS7_AGENT_DATA') ++ '/monitor' which translates to use of the JS7_AGENT_DATA environment variable created by the Agent's start script, see JS7 - Job Environment Variables. This environment variable can for example evaluate to /var/sos-berlin.com/js7/agent. The ++ operator indicates concatenation and is followed by the name of a sub-directory. In this example the report directory will be /var/sos-berlin.com/js7/agent/monitor.
  • monitor_report_max_files: The number of report files created will be limited to this value. Older report files will be removed when this value is exceeded.
  • from: Specifies the e-mail address that is used to send mail for notices and alerts. The argument is used by the job to create the subject and body return variables for use with a later MailJob.
  • max_failed_orders: The maximum number of failed orders that are considered acceptable for a health status check. If this number is exceeded then the result return variable will carry a non-zero value indicating a failed health check.
  • Select the check box provided with each argument if you want this argument to be added to the arguments of the MonitoringJob template.

When hitting the Submit button the wizard adds the required arguments to the job which should look like this:

Using the Job Wizard for the MailJob

Find instructions from the JS7 - JITL MailJob article.

Use of JS7 - Job Resources to specify mail parameterization is encouraged.

Health Status Check

The health status check performed by the MonitoringJob makes use of the JS7 REST API

  • to retrieve such information,
  • to write this information to a report file,
  • to evaluate if the information indicates a healthy JS7 environment.

Report File

Find a sample report file for download that indicates an alert: monitor.2022-08-17.09-16-44.9Z.alert.json

Sample Report File
{
  "controllerStatus" : {
    "active" : {
      "id" : 3,
      "surveyDate" : "2022-08-17T08:57:43.000+00:00",
      "controllerId" : "testsuite",
      "title" : "SECONDARY CONTROLLER",
      "host" : "controller-2-0-secondary",
      "url" : "https://controller-2-0-secondary:4443",
      "clusterUrl" : "https://controller-2-0-secondary:4443",
      "role" : "BACKUP",
      "isCoupled" : false,
      "startedAt" : "2022-08-16T18:09:27.000+00:00",
      "version" : "2.5.0-SNAPSHOT+fd0eb39",
      "javaVersion" : "17.0.4+8-alpine-r0",
      "os" : {
        "name" : "Linux",
        "architecture" : "amd64",
        "distribution" : "3.10.0-957.1.3.el7.x86_64"
      },
      "securityLevel" : "MEDIUM"
    },
    "volatileStatus" : {
      "id" : 2,
      "surveyDate" : "2022-08-17T09:16:45.064+00:00",
      "controllerId" : "testsuite",
      "title" : "PRIMARY CONTROLLER",
      "host" : "controller-2-0-primary",
      "url" : "https://controller-2-0-primary:4443",
      "clusterUrl" : "https://controller-2-0-primary:4443",
      "role" : "PRIMARY",
      "isCoupled" : true,
      "startedAt" : "2022-08-16T18:09:26.004+00:00",
      "version" : "2.5.0-SNAPSHOT+fd0eb39",
      "javaVersion" : "17.0.4+8-alpine-r0",
      "os" : {
        "name" : "Linux",
        "architecture" : "amd64",
        "distribution" : "3.10.0-957.1.3.el7.x86_64"
      },
      "securityLevel" : "MEDIUM",
      "componentState" : {
        "severity" : 0,
        "_text" : "operational"
      },
      "connectionState" : {
        "severity" : 0,
        "_text" : "established"
      },
      "clusterNodeState" : {
        "severity" : 0,
        "_text" : "active"
      }
    },
    "permanentStatus" : {
      "id" : 2,
      "surveyDate" : "2022-08-16T18:12:47.169+00:00",
      "controllerId" : "testsuite",
      "title" : "PRIMARY CONTROLLER",
      "host" : "controller-2-0-primary",
      "url" : "https://controller-2-0-primary:4443",
      "clusterUrl" : "https://controller-2-0-primary:4443",
      "role" : "PRIMARY",
      "startedAt" : "2022-08-16T18:09:26.004+00:00",
      "version" : "2.5.0-SNAPSHOT+fd0eb39",
      "javaVersion" : "17.0.4+8-alpine-r0",
      "os" : {
        "name" : "Linux",
        "architecture" : "amd64",
        "distribution" : "3.10.0-957.1.3.el7.x86_64"
      }
    }
  },
  "jocStatus" : {
    "active" : {
      "id" : 2,
      "memberId" : "joc-2-0-primary:97c88ccc3975703ebd0b7277d394ec8768f88b31775e8df038572d2547c240a0",
      "title" : "PRIMARY JOC COCKPIT",
      "current" : true,
      "host" : "joc-2-0-primary",
      "url" : "https://joc-2-0-primary:4443",
      "startedAt" : "2022-08-16T18:10:27.000+00:00",
      "version" : "2.5.0-SNAPSHOT",
      "connectionState" : {
        "severity" : 0,
        "_text" : "established"
      },
      "componentState" : {
        "severity" : 0,
        "_text" : "operational"
      },
      "clusterNodeState" : {
        "severity" : 0,
        "_text" : "active"
      },
      "controllerConnectionStates" : [ {
        "role" : "PRIMARY",
        "state" : {
          "severity" : 0,
          "_text" : "established"
        }
      }, {
        "role" : "BACKUP",
        "state" : {
          "severity" : 0,
          "_text" : "established"
        }
      } ],
      "os" : {
        "name" : "Linux",
        "architecture" : "amd64",
        "distribution" : "3.10.0-957.1.3.el7.x86_64"
      },
      "securityLevel" : "MEDIUM",
      "lastHeartbeat" : "2022-08-17T09:16:37.000+00:00"
    },
    "passive" : [ {
      "id" : 1,
      "memberId" : "joc-2-0-secondary:97c88ccc3975703ebd0b7277d394ec8768f88b31775e8df038572d2547c240a0",
      "title" : "SECONDARY JOC COCKPIT",
      "current" : false,
      "host" : "joc-2-0-secondary",
      "url" : "https://joc-2-0-secondary.sos:7543",
      "startedAt" : "2022-08-16T18:10:27.000+00:00",
      "version" : "2.5.0-SNAPSHOT",
      "connectionState" : {
        "severity" : 0,
        "_text" : "established"
      },
      "componentState" : {
        "severity" : 0,
        "_text" : "operational"
      },
      "clusterNodeState" : {
        "severity" : 1,
        "_text" : "inactive"
      },
      "controllerConnectionStates" : [ {
        "role" : "PRIMARY",
        "state" : {
          "severity" : 0,
          "_text" : "established"
        }
      }, {
        "role" : "BACKUP",
        "state" : {
          "severity" : 0,
          "_text" : "established"
        }
      } ],
      "os" : {
        "name" : "Linux",
        "architecture" : "amd64",
        "distribution" : "3.10.0-957.1.3.el7.x86_64"
      },
      "securityLevel" : "MEDIUM",
      "lastHeartbeat" : "2022-08-17T09:16:37.000+00:00"
    } ]
  },
  "agentStatus" : [ {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_001",
    "agentName" : "primaryAgent",
    "url" : "https://agent-2-0-primary:4443",
    "version" : "2.5.0-SNAPSHOT",
    "state" : {
      "severity" : 0,
      "_text" : "COUPLED"
    },
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 1,
    "isClusterWatcher" : true,
    "disabled" : false
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_002",
    "agentName" : "secondaryAgent",
    "url" : "https://agent-2-0-secondary:4443",
    "version" : "2.5.0-SNAPSHOT",
    "state" : {
      "severity" : 0,
      "_text" : "COUPLED"
    },
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : false
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_004",
    "agentName" : "wintestAgent",
    "url" : "http://192.11.0.146:4245",
    "version" : "2.4.0",
    "state" : {
      "severity" : 0,
      "_text" : "COUPLED"
    },
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : false
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_005",
    "agentName" : "apmaccsAgent",
    "url" : "http://192.11.3.3:4449",
    "state" : {
      "severity" : 2,
      "_text" : "UNKNOWN"
    },
    "healthState" : {
      "severity" : 2,
      "_text" : "NO_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : true
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_006",
    "agentName" : "apmacwinAgent",
    "url" : "http://192.11.2.2:4245",
    "state" : {
      "severity" : 2,
      "_text" : "UNKNOWN"
    },
    "healthState" : {
      "severity" : 2,
      "_text" : "NO_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : true
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_101",
    "agentName" : "agent17",
    "url" : "http://centostest_primary.sos:7775",
    "version" : "2.4.0-beta.20220714",
    "state" : {
      "severity" : 0,
      "_text" : "COUPLED"
    },
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : false
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_009",
    "agentName" : "oracleAgent",
    "url" : "http://minos.sos:4445",
    "version" : "2.4.0-beta.20220714",
    "state" : {
      "severity" : 0,
      "_text" : "COUPLED"
    },
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : false
  }, {
    "subagents" : [ {
      "isDirector" : "PRIMARY_DIRECTOR",
      "agentId" : "agent_cluster_001",
      "subagentId" : "director_primary_001",
      "url" : "https://diragent-2-0-primary:4443",
      "version" : "2.5.0-SNAPSHOT",
      "state" : {
        "severity" : 0,
        "_text" : "COUPLED"
      },
      "orders" : [ ],
      "runningTasks" : 0,
      "isClusterWatcher" : false,
      "disabled" : false
    }, {
      "isDirector" : "NO_DIRECTOR",
      "agentId" : "agent_cluster_001",
      "subagentId" : "subagent_primary_001",
      "url" : "https://subagent-2-0-primary:4443",
      "version" : "2.5.0-SNAPSHOT",
      "state" : {
        "severity" : 0,
        "_text" : "COUPLED"
      },
      "orders" : [ ],
      "runningTasks" : 0,
      "isClusterWatcher" : false,
      "disabled" : false
    }, {
      "isDirector" : "NO_DIRECTOR",
      "agentId" : "agent_cluster_001",
      "subagentId" : "subagent_secondary_001",
      "url" : "https://subagent-2-0-secondary:4443",
      "version" : "2.5.0-SNAPSHOT",
      "state" : {
        "severity" : 0,
        "_text" : "COUPLED"
      },
      "orders" : [ ],
      "runningTasks" : 0,
      "isClusterWatcher" : false,
      "disabled" : false
    }, {
      "isDirector" : "NO_DIRECTOR",
      "agentId" : "agent_cluster_001",
      "subagentId" : "subagent_third_001",
      "url" : "https://subagent-2-0-third:4443",
      "version" : "2.5.0-SNAPSHOT",
      "state" : {
        "severity" : 0,
        "_text" : "COUPLED"
      },
      "orders" : [ ],
      "runningTasks" : 0,
      "isClusterWatcher" : false,
      "disabled" : false
    } ],
    "controllerId" : "testsuite",
    "agentId" : "agent_cluster_001",
    "agentName" : "AgentCluster001",
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : false
  }, {
    "subagents" : [ ],
    "controllerId" : "testsuite",
    "agentId" : "agent_014",
    "agentName" : "winutf8Agent",
    "url" : "http://192.11.0.146:4445",
    "version" : "2.4.0",
    "state" : {
      "severity" : 0,
      "_text" : "COUPLED"
    },
    "healthState" : {
      "severity" : 0,
      "_text" : "ALL_SUBAGENTS_ARE_COUPLED_AND_ENABLED"
    },
    "orders" : [ ],
    "runningTasks" : 0,
    "isClusterWatcher" : false,
    "disabled" : false
  } ],
  "orderSnapshot" : {
    "pending" : 0,
    "scheduled" : 1262,
    "inProgress" : 0,
    "running" : 1,
    "prompting" : 0,
    "suspended" : 0,
    "waiting" : 770,
    "blocked" : 0,
    "failed" : 0,
    "terminated" : 1
  },
  "orderSummary" : {
    "failed" : 0
  }
}

Health Status Checks

The MonitoringJob performs the following health status checks:

  • Controller
    • In volatileStatus the element connectionStates includes severity with a value 0.
    • In volatileStatus the element componentState includes severity with a value 0.
    • If role is present and does not carry the value STANDALONE in volatileStatus then the element clusterNodeState has to have severity with a value 0.
    • If role is present and does not contain the value STANDALONE in volatileStatus then the element isCoupled has to have the value true.
  • Agents
    • In agentStatus the healthState is present and has severity with a value 0.
    • In agentStatus the state is present and has severity with a value 0.
    • For each enabled subAgent the state has severity with a value 0.
  • JOC Cockpit
    • The connectionState has severity with a value 0.
    • The componentState has severity with a value 0.
    • If clusterNodeState is present it has severity with a  value 0.
    • If controllerConnectionStates is present each connectionState has severity with a value 0.

The number of failed checks is reported by the result return variable, see next section.

Documentation

The Job Documentation including the full list of arguments can be found under: https://www.sos-berlin.com/doc/JS7-JITL/MonitoringJob.xml

Authentication

The Job makes use of the JS7 - REST Web Service API that is available from JOC Cockpit. 

  • The job is executed with an Agent and requires a network connection to JOC Cockpit.
  • The job has to authenticate with JOC Cockpit, for the related configuration see JS7 - JITL Common Authentication.

Arguments

The MonitoringJob class accepts the following arguments:

NameRequiredDefault ValuePurposeExample
controller_idno

Optionally specifies the identification of the Controller to be checked. By default the current Controller is used.

controller_prod

monitor_report_dir

yes

Specifies the directory to which the job will store health status report files (.json). This directory has to exist prior to running the job and has to be in reach of the Agent that runs the job. 

    • An absolute or relative path can be specified.
    • An expression can be used., for example  env('JS7_AGENT_DATA') ++ '/monitor' 

env('JS7_AGENT_DATA') ++ '/monitor'

/var/sos-berlin.com/js7/agent/monitor

C:\ProgramData\sos-berlin.com\js7\agent\monitor

monitor_report_max_filesyes
The number of report files created will be limited to this value. Older report files will be removed when this value is exceeded25
fromyes

Specifies the e-mail address that is used to send mail for notices and alerts. The argument is used by the job to create the subject and body return variables.

js7@example.com
max_failed_ordersno

The maximum number of failed orders that are considered acceptable for a health status check. If this number is exceeded then the result return variable will carry a non-zero value indicating a failed health status check.

By default the number of failed orders is not considered for successful/unsuccessful health status checks.

3

Return Variables

The MonitoringJob class returns the following variables for use by subsequent jobs:

NameData TypePurposeExample
monitor_report_dateString

The date and time for which the health status check has been performed. The date format is yyyy-MM-dd.HH-mm-ss.K, for example 2022-07-31.23-12-59.Z indicating UTC time

controller_prod
monitor_report_fileStringThe path to the report file created for the health status check./var/sos-berlin.com/js7/agent/monitor/monitor.2022-08-15.17-35-36.5.json
subjectString

The subject of an e-mail for use with a later MailJob.

JS7 Monitor: Notice from: js7@sos-berlin.com at: 2022-08-15.17-35-36.5
bodyString

The body of an e-mail for use with a later MailJob, by default the value is the same as for the subject.

JS7 Monitor: Notice from: js7@sos-berlin.com at: 2022-08-15.17-35-36.5
resultNumberThe number of problems identified during the health status check. A value 0 indicates absence of problems, other values indicate existence of problems.0

Further Resources



  • No labels