Purpose

JS7 implements resilience at the following levels:

  • Architecture: All products can be clustered for high availability, implementing an active-passive cluster architecture with automated fail-over. For details see the JS7 - System Architecture article.
  • Communication: Products communicate asynchronously. Practically this means that any product can be shut down or can be subject to an outage without breaking the availability of any other product. Products reconcile after restart and synchronize state information to catch-up with the latest processing results. For details see JS7 - Implementation Architecture article.
  • Programming: The programming model is based on the handling of asynchronous events that are raised for state transitions. For details see JS7 - REST Web Service API article.

Allocation of Duties

Each product is assigned a specific duty:

  • The JOC Cockpit manages the inventory of workflows, jobs and related objects. In addition, the JOC Cockpit is used to monitor and control workflow execution by other products.
    • An outage of the JOC Cockpit does not impact workflow execution by the Controller or by Agents.
    • An outage of the JOC Cockpit means that users are unaware about what workflows are currently being executed. However, this does not indicate that workflows would not run.
    • Any results about workflow execution are reported later on by a Controller when the JOC Cockpit becomes available.
  • The Controller orchestrates Agents and forwards, for example, JS7 - Workflows and the JS7 - Daily Plan to Agents.
    • If the Controller is not available then this does not affect availability of the JOC Cockpit. 
    • For Agents, the loss of a connection from the Controller means that they cannot immediately report back with execution results. However, Agents will continue to execute workflows that are within reach of autonomous workflow instructions and will store the information about JS7 - Order State Transitions and log output created by jobs with their journal. This information will then be forwarded to a Controller later. You ca find details about JS7 - Workflow Instructions that are eligible for Agent autonomy in the JS7 - Workflow Execution with Controllers and Agents article.
    • A prominent exception to this rule is workflows that implement cross-platform scheduling, i.e. executing jobs within the same workflow on different Agents. In this situation an Agent can proceed with a workflow only to the nodes that are assigned to that Agent.
  • An Agent will execute JS7 - Workflow Instructions as long as the instructions - including the execution of jobs - are assigned that Agent.
    • Agents expect Controllers to establish a connection and will respond to connection requests but cannot actively establish a connection to a Controller.
    • Agents receive Workflow configurations and the Daily Plan from a Controller and know when to run orders. Agents therefore work semi-autonomously within the limits of being assigned the relevant workflow instructions.

Cluster Architecture

Redundancy and restart capability is provided by clustering the JS7 products for automated fail-over as follows:

  • The JOC Cockpit can be operated for an active-passive cluster with one active instance and any number of passive instances.
    • Fail-over is handled automatically within 60s between cluster members by the JS7 - Cluster Service.
    • The JOC Cockpit cluster relies on a persistence layer provided by the JS7 - Database.
  • The Controller implements an active-passive cluster with one active instance and one passive instance.
    • The Controller implements clustering and journaling by itself and does not require additional products such as a DBMS, see JS7 - Controller Cluster.
    • Cluster members couple and synchronize automatically. Fail-over time is typically around 3s.
  • The Agent offers both an active-passive cluster and an active-active cluster, see the JS7 - Agent Cluster article.

Communication

Asynchronous communication is based on the fact that messages are sent to a partner product without relying on the availability of the given productt: it is neither guaranteed that a message is received by its recipient nor can be assumed that the recipient will be able to respond in good time.

  • If the communication between products breaks, for example due to connection loss or network issue, then the calling productt will repeatedly try to reconnect to the partner product. This mechanism works for the duration of the outage - for minutes, hours or days.
  • If messages cannot be forwarded then they are stored in memory for later retry:
    • if the calling product is restarted then messages about status information requests are lost.
    • in case of status change requests, messages are stored persistently.
  • Therefore it makes no sense to restart a calling product if the partner product is not available. The mantra to "restart the Windows server" does not apply for JS7 except when users have good reason to assume that a connection loss is due to issues with system resources.

Programming Model

The programming model includes handling of asynchronous events that are passed between products:  

  • The Controller and Agents raise events for order state transitions.
  • The Controller subscribes to events that originate from Agents. The JOC Cockpit subscribes to events that are forwarded from a Controller.
  • The asynchronous nature of events is handled by the receiving product. Any events remain in place with the originating product until the receiving product confirms receipt. Only then events are released from the originating product.

Further Resources


  • No labels