Recovery Strategies

Introduction

Recovery involves the procedures required to re-establish the job scheduling service following a disaster. Main failure scenarios include

Prevention
- Using a clustered Database Management System is an appropriate means to guarantee continuity of this service and to prevent data loss.
- Setting up database replication would allow to re-establish the database service in a more timely manner and to reduce data loss.
Impact
- JobScheduler will stop to work with its database not being available.
- JobScheduler can be configured to either terminate or to wait for the database to become available and to automatically reconnect.
- Individual transactions might be lost during failover in a clustered database environment.
Recovery
- see How does JobScheduler behave in the event of a database failure?

Prevention
- JobScheduler can be used for Cluster Operation. This includes the use of a Passive Cluster or of an Active Cluster.
- In a clustered environment JobScheduler provides mechanisms for automated failover.
- More details are given in the section about Fault Tolerance
Impact
- If a server node with a JobScheduler instance is down then in a clustered environment other instances would take over the load:
  - In a Passive Cluster a Backup JobScheduler would continue to process jobs and jobs chains starting from the point where the Primary JobScheduler stopped.
  - In an Active Cluster other JobScheduler instances would continue to process jobs and job chains in a similar way.
Recovery
- In a Passive Cluster the Backup JobScheduler has to be terminated normally and the Primary JobScheduler has to be started.
- In an Active Cluster a JobScheduler instance can be added at any point in time.

Prevention
- Use of a clustered storage system, e.g. by network attached storage, with redundancy, e.g. by redundant arrays of disk, would prevent JobScheduler from being affected by such failure.
Impact
- If storage were not available, e.g. due to disk space exhaustion, then JobScheduler would stop to work.
- JobScheduler will terminate in case of failure of the storage system.
Recovery
- Re-establish the storage system. Should data have been lost then restore them from backups including the JobScheduler installation directory and its data directory.
  - The installation directory stores files as provided by the installer and a few configurations files that might have been edited manually, see the ./config directory with the files scheduler.xml, factory.ini and sos.ini.
  - The data directory stores the log files. Copies of the logs are available in the database if JobScheduler were configured accordingly.
- Restart JobScheduler.

Prevention
- Use of redundant components for network operation (DHCP and DNS services, routers etc.) would reduce the risk of network outages.
Impact
- JobScheduler will continue to run in case of network failures. However, it might not be able to communicate
  - with its database which would cause the processing of jobs and job chains to stop.
  - with its Agents and with remote servers which would prevent Cross-Platform Scheduling
  - with other cluster members.
Recovery
- Restart JobScheduler.

For large organizations the planning for failover scenarios will be more complex due to
- the involvement of business applications and
- the need for common failover of JobScheduler and other applications and services.
SOS provides Consulting Services to develop this planning with the customers.

Content by label

There is no content with the specified labels

Content by label

There is no content with the specified labels

Content by label

There is no content with the specified labels