Skip to end of metadata
Go to start of metadata

Introduction

Recovery involves the procedures required to re-establish the job scheduling service following a disaster. Main failure scenarios include

  • failure of the Database Management System,
  • failure of the Server Environment that JobScheduler is operated for,
  • failure of the Storage System,
  • failue of the Network System
  • or any combination of the above.

Failure Scenarios

Failure of the Database Management System

  • Prevention
    • Using a clustered Database Management System is an appropriate means to guarantee continuity of this service and to prevent data loss.
    • Setting up database replication would allow to re-establish the database service in a more timely manner and to reduce data loss.
  • Impact
    • JobScheduler will stop to work with its database not being available.
    • JobScheduler can be configured to either terminate or to wait for the database to become available and to automatically reconnect.
    • Individual transactions might be lost during failover in a clustered database environment.
  • Recovery

Failure of the Server Environment

  • Prevention
  • Impact
    • If a server node with a JobScheduler instance is down then in a clustered environment other instances would take over the load:
      • In a Passive Cluster a Backup JobScheduler would continue to process jobs and jobs chains starting from the point where the Primary JobScheduler stopped.
      • In an Active Cluster other JobScheduler instances would continue to process jobs and job chains in a similar way.
  • Recovery
    • In a Passive Cluster the Backup JobScheduler has to be terminated normally and the Primary JobScheduler has to be started.
    • In an Active Cluster a JobScheduler instance can be added at any point in time.

Failure of the Storage System

  • Prevention
    • Use of a clustered storage system, e.g. by network attached storage, with redundancy, e.g. by redundant arrays of disk, would prevent JobScheduler from being affected by such failure.
  • Impact
    • If storage were not available, e.g. due to disk space exhaustion, then JobScheduler would stop to work.
    • JobScheduler will terminate in case of failure of the storage system.
  • Recovery
    • Re-establish the storage system. Should data have been lost then restore them from backups including the JobScheduler installation directory and its data directory.
      • The installation directory stores files as provided by the installer and a few configurations files that might have been edited manually, see the ./config directory with the files scheduler.xml, factory.ini and sos.ini.

      • The data directory stores the log files. Copies of the logs are available in the database if JobScheduler were configured accordingly.
    • Restart JobScheduler.

Failure of the Network System

  • Prevention
    • Use of redundant components for network operation (DHCP and DNS services, routers etc.) would reduce the risk of network outages.
  • Impact
    • JobScheduler will continue to run in case of network failures. However, it might not be able to communicate
      • with its database which would cause the processing of jobs and job chains to stop.
      • with its Agents and with remote servers which would prevent Cross-Platform Scheduling
      • with other cluster members. 
  • Recovery
    • Restart JobScheduler.

Disaster Recovery Planning

  • For large organizations the planning for failover scenarios will be more complex due to
    • the involvement of business applications and
    • the need for common failover of JobScheduler and other applications and services.
  • SOS provides Consulting Services to develop this planning with the customers.

Resources

Frequently Asked Questions