Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Fail-over occurs automatically as opposed to switch-over that which is caused triggered by user intervention.

Fail-over does not occur in case of normal shutdown of a Controller instance as this does not indicate failure, but for example a maintenance interval. Consider that normal shutdown without fail-over similarly applies for server shutdown

Subsequently the term unavailability is used. This can include different indicates a number of situations:

  • The Controller instance is crashed, is killed or the underlying machine is crashed.
  • The Controller instance continues to run but is isolated in the network which means that the paring Controller instance and active JOC Cockpit instance cannot see connect to the instance. Network isolation can be a tricky source of fail-over if this occurs for a short period only.

...

This scenario includes situations in which when a single Controller instance or the underlying machine becomes become unavailable.

  • If the active Controller instance becomes unavailable then the standby Controller instance and the active JOC Cockpit Cluster Watch within a few seconds will determine to perform fail-over within a few seconds.
  • If the standby Controller instance becomes unavailable then the active Controller instance will continue to run without changes.
  • If a previously unavailable (as opposed to shutdown) Controller instance is started after fail-over then it will take the standby role in the Controller Cluster. It will synchronize its journal from the active Controller instance and will re-establish the cluster.

...

This scenario includes situations when a single Controller instance and a single JOC Cockpit instance running on the same or different machine become unavailable.

  • If the active Controller instance and active JOC Cockpit instance become unavailable at the same point in time then
    • first the standby JOC Cockpit instance will become active. This includes that the Cluster Watch role will move to this instance. Fail-over of JOC Cockpit will take <30s. There can be a slightly longer duration in case that we find a larger number of orders that have not been completed and for which the newly active JOC Cockpit instance has to re-read state transition events and log events from the remaining Controller instance. Observations include that 500 orders with maybe 2000 jobs can delay fail-over by approx. 60s.
    • next the newly active JOC Cockpit instance and the standby Controller instance agree on fail-over and the standby Controller instance becomes active within a few seconds.
  • If the standby Controller instance and standby JOC Cockpit instance become unavailable then this does not affect the active Controller instance and active JOC Cockpit instance.

...

This scenario includes a situation when both any machines each holding a Controller instance and JOC Cockpit instance become unavailable at the same point in time.

  • This situation can be considered a disaster as all redundant nodes are gone at the same point in time. This situation requires user intervention.
  • Reasons are as follows:
    • When the previously active Controller instance is started then it remembers having had this role and will ask the JOC Cockpit Cluster Watch for confirmation.
    • The newly started JOC Cockpit instance with the Cluster Watch role cannot confirm the Controller's request as it has no memory before the point in time of unavailability and does not know which Controller instance had the active role before the unavailability occurred. The Cluster Watch cannot simply believe confirm the claim of any Controller instance to become active as the this claim can be falsewrong. For example, if a Controller instance crashed some days earlier and in between fail-over occurred to the then standby Controller instance. If in this situation both Controller instances are (re)started then the JOC Cockpit Cluster Watch can determine the Controller instance with the active role as it was a witness to the respective Controller instance's last crash or shutdown.
    • In this scenario if both any machines die at the same time there is no fail-over between JOC Cockpit instances. This means that the Cluster Watch cannot act as an arbitrator and instead has to ask the user for confirmation. JOC Cockpit will show a red alarm bell to indicate that user intervention is required.
  • User confirmation includes to indicate which Controller instance consent that one of the Controller instances that is suggested by the JOC Cockpit Cluster Watch should be considered being lost. The remaining Controller instance will take the active role. 
    • Before confirming users should check that the Controller instance that is to be declared lost in fact is shutdown.
    • If this check is missed and if the lost Controller instance in fact is up & running and considers itself to have the active role then this can cause both Controller instances to be become active and can result in double job execution. As a consequence the Controller Cluster has to be recreated and Agents have to be initialized.
  • Users confirm loss of a the indicated Controller instance from the Dashboard view like this:

         Image Modified

Further Resources

...