Skip to end of metadata
Go to start of metadata

Purpose

JS7 implements resilience at the following levels:

  • Architecture: All components can be clustered for high availability, implementing an active-passive cluster architecture with automated fail-over. For details see JS7 - System Architecture.
  • Communication: Components communicate asynchronously, practically this means that any component can be shut down or can be subject to an outage without breaking the availability of any other component. Components reconcile after restart and synchronize state information to catch-up with latest processing results. For details see JS7 - Implementation Architecture.
  • Programming: The programming model is based on the handling of asynchronous events that are raised for state transitions. For details see JS7 - REST Web Service API.

Allocation of Duties

Each component is assigned a specific duty:

  • The JOC Cockpit manages the inventory of workflows, jobs and related objects. In addition, the JOC Cockpit is used to monitor and to control workflow execution by other components.
    • An outage of JOC Cockpit does not impact workflow execution by the Controller or by Agents.
    • An outage of JOC Cockpit translates to the fact that users are unaware about what workflows are currently executed, however, this does not indicate that workflows would not run.
    • Any results about workflow execution are reported later on by a Controller when the JOC Cockpit becomes available.
  • The Controller orchestrates Agents and forwards, for example, JS7 - Workflows and the JS7 - Daily Plan to Agents.
    • If the Controller is not available then this does not affect availability of JOC Cockpit. 
    • For Agents the loss of a connection from the Controller means that they cannot immediately report back execution results. However, Agents will continue to execute workflows that are within reach of autonomous workflow instructions and will store the information about JS7 - Order State Transitions and log output created by jobs with their journal. This information will then be forwarded to a Controller later. Find details about JS7 - Workflow Instructions that are eligible for Agent autonomy from the JS7 - Workflow Execution with Controllers and Agents article.
    • A prominent exception to this rule are workflows that implement cross-platform scheduling, i.e. executing jobs within the same workflow on different Agents. In this situation an Agent can proceed with a workflow to the nodes only that are assigned to that Agent.
  • An Agent will execute JS7 - Workflow Instructions as long as the instructions - including the execution of jobs - are assigned that Agent.
    • Agents expect Controllers to establish a connection and will respond to connection requests but cannot actively establish a connection to a Controller.
    • Agents receive Workflow configurations and the Daily Plan from a Controller and know when to run orders. Agents therefore work semi-autonomously within the limits of being assigned the relevant workflow instructions.

Cluster Architecture

Redundancy is provided by clustering the components for automated fail-over as follows:

  • The JOC Cockpit can be operated for an active-passive cluster with one active instance and any number of passive instances.
    • Fail-over is handled automatically within 60s between cluster members by the JS7 - Cluster Service.
    • The JOC Cockpit cluster relies on a persistence layer provided by the JS7 - Database.
  • The Controller implements an active-passive cluster with one active instance and one passive instance.
    • The Controller implements clustering and journaling by itself and does not require additional components such as a DBMS.
    • Cluster members couple and synchronize automatically. Fail-over time is typically around 3s.
  • The Agent offers both an active-passive cluster and an active-active cluster.

Communication

Asynchronous communication is based on the fact that messages are sent to a partner component without relying on the availability of the given component: it is neither guaranteed that a message is received by its recipient nor can it be assumed that the recipient will be able to respond in good time.

  • If the communication between components breaks, e.g. due to a connection loss or network issue, then the calling component will repeatedly try to reconnect to the partner component. This mechanism works for the duration of the outage - for minutes, hours or days.
  • If messages cannot be forwarded then they are stored in memory for retrying later:
    • if the calling component is restarted then messages about status information requests are lost.
    • in case of status change requests, messages are stored persistently.
  • Therefore it makes no sense to restart a calling component if the partner component is not available. The mantra to "restart the Windows server" does not apply for JS7 except when you had good reason to assume that a connection loss is due to issues with system resources.

Programming Model

The programming model includes handling of asynchronous events that are passed between components:  

  • The Controller and Agents raise events for order state transitions.
  • The Controller subscribes to events that originate from Agents. JOC Cockpit subscribes to events that are forwarded from a Controller.
  • The asynchronous nature of events is handled by the receiving component. Any events remain in place with the originating component until the receiving component confirms receipt. Only then are events released from the originating component.

Further References


  • No labels
Write a comment…