You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Introduction

Use of a Controller Cluster provides high availability and is a feature subject to the JS7 - License.

  • Fail-over is an automated operation that occurs when the Primary Controller is aborted or killed. Fail-over is applied in case of abnormal termination only.
  • Switch-over is an operation that is caused by user intervention in JOC Cockpit or by use of the JS7 - REST Web Service API. The procedure includes normal termination of an Active Controller Instance.

For fail-over and switch-over a Cluster Watch Agent is required.

For command line references see the JS7 - Controller - Command Line Operation article.

Cluster Roles

Controller Cluster

The documentation frequently indicates a Primary Controller Instance and a Secondary Controller Instance. The names suggest that one Controller Instance is primarily used and one is for backup purposes.

  • The wording in cluster terms suggests to indicate the Active Controller Instance and the Standby Controller Instance independently from the fact if the Primary or Secondary Controller Instance is active.
  • A Controller implements an active-passive cluster, however, the term passive is misleading as the Standby Controller Instance is not passive at all but records any state transitions occurring in the Active Controller Instance. Both Controller instances hold a journal of state transitions that is actively synchronized. Fail-over and switch-over will occur only if both Controller instance's journals are in sync.
  • The Cluster presents itself as a single unit to the outside world, i.e. to JOC Cockpit and to Agents.
    • Any operations performed in JOC Cockpit are automatically applied to the Active Controller Instance.
    • At any point in time only one Controller instance is active and the other instance is in standby mode.

Cluster Watch Agent

Primary and Secondary Controller instances require a single Agent to be available that acts as an arbitrator in case of fail-over and switch-over.

Start-up of Controller Instances

  • On start-up both Primary and Secondary Controller instances connect to the Cluster Watch Agent.
    • The Cluster Watch Agent give its vote

Failure of the Active Controller Instance

x

  • The Agent Cluster Watcher knows immediately when the active Controller instance is down.
  • The standby Controller instance similarly has a connection to the active Controller instance and knows immediately when this connection is interrupted.
  • This is the point in time when passive Controller instance and the Agent Cluster Watcher check if they find “common ground”. This works similar to a funeral society, they determine if they consider the active Controller instance being dead and after a very short period (1-2s of crying tears) they proceed and give their 2 votes if the passive Controller instance should now become the active one.

Cluster Operations

Cluster operations include an automated fail-over and a manual switch-over of the Active Controller Instance.

Fail-over

Fail-over occurs when the Active Controller Instance is terminated abnormally. 

Fail-over can be invoked by the following actions:

  • The Active Controller Instance is killed, for example
    • for Unix with a SIGKILL signal corresponding to the command: kill -9
    • for Windows with the command: taskkill /F
  • The operating system crashes.
  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu: Abort and restart -> With fail-over
  • From the command line the user performs one of the operations:
    • controller.sh | .cmd abort
    • controller.sh | .cmd kill

No fail-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Fail-over happens within a short period of time, typically in 2-3s.

Switch-over

Switch-over occurs exclusively when invoked by user intervention.

Switch-over can be invoked by the following actions:

  • In the JS7 - Dashboard the user performs one of the operations: 
    • Active Controller Instance action menu: Abort -> With fail-over
    • Active Controller Instance action menu: Abort and restart -> With fail-over

No switch-over occurs when

  • the Active Controller Instance is stopped normally from the command line:
    • controller.sh | .cmd stop
  • the operating system is shut down and systemd / init.d or a Windows Service are in place to stop the Controller normally.

Switch-over happens within a short period of time, typically in 2-3s.

A Warning to Users trying to implement their own Clustering Mechanism

Users might be tempted to implement their own clustering with Standalone Controller Instances, for example

  • using tools for virtual machine management such as VMware®,
  • using Microsoft® Windows Server Cluster or similar cluster solutions.

The best advice is not to apply such clustering mechanisms. Reasons include but are not limited to the following issues:

  • The cluster has to guarantee that only one of both instances is started at any point in time.
    • Should this rule not be observed then both Controller instances will instruct Agents to execute the same workflows and jobs which will result in double job execution.
    • Controller journals will be messed up with the same orders in different state transitions.
    • The only solution is to drop both Controller instance's journals that are available from the state sub-directory, to accept that any orders are lost and to redeploy scheduling objects.
  • There is no simple way to determine if a Controller instance is not in perfect shape to manage orders.
    • Performing PID file checks is of limited use: this can prove the unavailability of a Controller but a positive PID file check does not prove that a Controller instance is working.
    • Log file analysis is pointless. Controllers are heavily making use of asynchronous operations when it comes to Agents. Occurrence of error messages in log files does not prevent a situation to be recovered within the next few seconds.



  • No labels