Introduction

Recovery involves the procedures required to re-establish the job scheduling service following a disaster. Main failure scenarios include

  • failure of the Database Management System,
  • failure of the Server Environment that JobScheduler is operated for,
  • failure of the Storage System,
  • failue of the Network System
  • or any combination of the above.

Failure Scenarios

Failure of the Database Management System

  • Prevention
    • Using a clustered Database Management System is an appropriate means to guarantee continuity of this service and to prevent data loss.
    • Setting up database replication would allow to re-establish the database service in a more timely manner and to reduce data loss.
  • Impact
    • JobScheduler will stop to work with its database not being available.
    • JobScheduler can be configured to either terminate or to wait for the database to become available and to automatically reconnect.
    • Individual transactions might be lost during failover in a clustered database environment.
  • Recovery

Failure of the Server Environment

  • Prevention
  • Impact
    • If a server node with a JobScheduler instance is down then in a clustered environment other instances would take over the load:
      • In a Passive Cluster a Backup JobScheduler would continue to process jobs and jobs chains starting from the point where the Primary JobScheduler stopped.
      • In an Active Cluster other JobScheduler instances would continue to process jobs and job chains in a similar way.
  • Recovery
    • In a Passive Cluster the Backup JobScheduler has to be terminated normally and the Primary JobScheduler has to be started.
    • In an Active Cluster a JobScheduler instance can be added at any point in time.

Failure of the Storage System

  • Prevention
    • Use of a clustered storage system, e.g. by network attached storage, with redundancy, e.g. by redundant arrays of disk, would prevent JobScheduler from being affected by such failure.
  • Impact
    • If storage were not available, e.g. due to disk space exhaustion, then JobScheduler would stop to work.
    • JobScheduler will terminate in case of failure of the storage system.
  • Recovery
    • Re-establish the storage system. Should data have been lost then restore them from backups including the JobScheduler installation directory and its data directory.
      • The installation directory stores files as provided by the installer and a few configurations files that might have been edited manually, see the ./config directory with the files scheduler.xml, factory.ini and sos.ini.

      • The data directory stores the log files. Copies of the logs are available in the database if JobScheduler were configured accordingly.
    • Restart JobScheduler.

Failure of the Network System

  • Prevention
    • Use of redundant components for network operation (DHCP and DNS services, routers etc.) would reduce the risk of network outages.
  • Impact
    • JobScheduler will continue to run in case of network failures. However, it might not be able to communicate
      • with its database which would cause the processing of jobs and job chains to stop.
      • with its Agents and with remote servers which would prevent Cross-Platform Scheduling
      • with other cluster members. 
  • Recovery
    • Restart JobScheduler.

Disaster Recovery Planning

  • For large organizations the planning for failover scenarios will be more complex due to
    • the involvement of business applications and
    • the need for common failover of JobScheduler and other applications and services.
  • SOS provides Consulting Services to develop this planning with the customers.

Resources

Feature in detail

There is no content with the specified labels

How To ... Instructions

There is no content with the specified labels

Frequently Asked Questions

Examples in detail

There is no content with the specified labels