JS7 implements resilience at the following levels:
- Architecture: All components can be clustered for high availability, implementing an active-passive cluster architecture with automated fail-over. For details see the JS7 - System Architecture article.
- Communication: Components communicate asynchronously. Practically this means that any component can be shut down or can be subject to an outage without breaking the availability of any other component. Components reconcile after restart and synchronize state information to catch-up with the latest processing results. For details see JS7 - Implementation Architecture article.
- Programming: The programming model is based on the handling of asynchronous events that are raised for state transitions. For details see JS7 - REST Web Service API article.
Allocation of Duties
Each component is assigned a specific duty:
- The JOC Cockpit manages the inventory of workflows, jobs and related objects. In addition, the JOC Cockpit is used to monitor and control workflow execution by other components.
- An outage of the JOC Cockpit does not impact workflow execution by the Controller or by Agents.
- An outage of the JOC Cockpit means that users are unaware about what workflows are currently being executed. However, this does not indicate that workflows would not run.
- Any results about workflow execution are reported later on by a Controller when the JOC Cockpit becomes available.
- The Controller orchestrates Agents and forwards, for example, JS7 - Workflows and the JS7 - Daily Plan to Agents.
- If the Controller is not available then this does not affect availability of the JOC Cockpit.
- For Agents, the loss of a connection from the Controller means that they cannot immediately report back with execution results. However, Agents will continue to execute workflows that are within reach of autonomous workflow instructions and will store the information about JS7 - Order State Transitions and log output created by jobs with their journal. This information will then be forwarded to a Controller later. You ca find details about JS7 - Workflow Instructions that are eligible for Agent autonomy in the JS7 - Workflow Execution with Controllers and Agents article.
- A prominent exception to this rule is workflows that implement cross-platform scheduling, i.e. executing jobs within the same workflow on different Agents. In this situation an Agent can proceed with a workflow only to the nodes that are assigned to that Agent.
- An Agent will execute JS7 - Workflow Instructions as long as the instructions - including the execution of jobs - are assigned that Agent.
- Agents expect Controllers to establish a connection and will respond to connection requests but cannot actively establish a connection to a Controller.
- Agents receive Workflow configurations and the Daily Plan from a Controller and know when to run orders. Agents therefore work semi-autonomously within the limits of being assigned the relevant workflow instructions.
Redundancy is provided by clustering the components for automated fail-over as follows:
- The JOC Cockpit can be operated for an active-passive cluster with one active instance and any number of passive instances.
- The Controller implements an active-passive cluster with one active instance and one passive instance.
- The Controller implements clustering and journaling by itself and does not require additional components such as a DBMS, see JS7 - Controller Cluster.
- Cluster members couple and synchronize automatically. Fail-over time is typically around 3s.
- The Agent offers both an active-passive cluster and an active-active cluster, see the JS7 - Agent Cluster article.
Asynchronous communication is based on the fact that messages are sent to a partner component without relying on the availability of the given component: it is neither guaranteed that a message is received by its recipient nor can be assumed that the recipient will be able to respond in good time.
- If the communication between components breaks, for example due to connection loss or network issue, then the calling component will repeatedly try to reconnect to the partner component. This mechanism works for the duration of the outage - for minutes, hours or days.
- If messages cannot be forwarded then they are stored in memory for later retry:
- if the calling component is restarted then messages about status information requests are lost.
- in case of status change requests, messages are stored persistently.
- Therefore it makes no sense to restart a calling component if the partner component is not available. The mantra to "restart the Windows server" does not apply for JS7 except when users have good reason to assume that a connection loss is due to issues with system resources.
The programming model includes handling of asynchronous events that are passed between components:
- The Controller and Agents raise events for order state transitions.
- The Controller subscribes to events that originate from Agents. The JOC Cockpit subscribes to events that are forwarded from a Controller.
- The asynchronous nature of events is handled by the receiving component. Any events remain in place with the originating component until the receiving component confirms receipt. Only then events are released from the originating component.