Master/Agent Availability

Scope

JobScheduler includes a number of measures to improve Master/Agent availability.
- Agent Bundles can be used to compensate the outage of a server that runs an Agent.
- Master/Agent Reconciliation allows continued execution of tasks in case of short-term connection loss.
- Master/Agent Recovery includes supported measures after a Master Server Failure.

Agent Bundles

Feature

JobScheduler allows multiple Agents to be specified for a single Process Class.
- JobScheduler contacts Agents in round-robin mode (JS-1188):
  - the first Agent that is configured to execute jobs for the process class will be contacted.
  - if the first Agent is not available then the next Agent listed in the process class configuration will be contacted
  - this procedure will be repeated until an Agent is found that can execute the job.
- Use cases for this scenario include
  - all Agents running on different server nodes: the switch to the next available Agent implements a fail-over to the next server node.
  - a number of Agents running on the same server node: the switch to the next available Agent implements redundancy of Agents within a single server node.
Delimitation
- This feature is not intended for load sharing as the JobScheduler will always use the first available Agent.
- This feature is not intended for scalability as it does not allow the execution of jobs in parallel on a number of Agents (clustering).
Feature Availability
- FEATURE AVAILABILITY STARTING FROM RELEASE 1.10

Issues

JS-1188 - Getting issue details... STATUS

Master / Agent Reconciliation

Scenario

Types of outages
- Connection Loss
  - a recoverable, temporary connection loss for a configurable period of time, e.g. 20s.
- Master Server Failure
  - an unrecoverable connection loss that takes more time than the period specified for the Connection Loss scenario or
  - a JobScheduler Master restart or server restart.
Supported scenarios
- Master/Agent Reconciliation addresses the Connection Loss scenario, not the Master Server Failure scenario.

Feature

Reconciliation Scenario
- applies after a Connection Loss between Master and Agent.
- includes re-establishing the normal relationship between Master and Agent after an outage.
Agent Behavior
- By default an Agent will kill any running tasks immediately if the connection to the Master gets lost, i.e. none of the above scenarios is supported (JS-1523). The reasons for this are:
  - If a Master were not available for a longer period then the Agent could not report back the execution history and log information for tasks. This would result in the fact that no information is available with the Master if the job execution has been successful or not.
  - The primary goal is to prevent duplicate execution of jobs. Without further information from a Master the respective Agent instance cannot know if later on it will be contacted for re-execution of the same job (which would allow to continue a currently running task on an Agent) or if the Master will choose a different Agent (see Agent Bundle).
- With a Connection Loss setting configured with the process class the Agent will show the following behavior (JS-1524):
  - During the period specified for the tolerated connection loss duration the Agent will assume the Connection Loss scenario.
  - The Agent will continue any running tasks up to the end of the tolerated connection loss period.
    - Reconciliation will take place if the connection between Master and Agent can be re-established during the connection loss period and if the Master has not been restarted.
    - Otherwise the Agent will assume the Master Server Failure scenario and kill any running tasks.
  - This behavior applies to tasks that are executed for a specific Master for which a connection has been lost. Tasks for other JobScheduler Master instances will be continued.
Master/Agent Reconciliation
- After connection loss the Master will regularly attempt to re-establish the HTTP connection to the Agent. This communication includes a "tunnel" that allows the Agent to report the execution status of running jobs to the Master.
- After a successful re-connect within the Connection Loss scenario the Master will repeat its request for execution of the respective jobs. Each new request includes an identifier for the previous execution request that allows the Agent to identify repeated requests:
  - for a job that has been completed within the tolerated connection loss period the Agent will report the execution result back to the Master.
  - for a job that is still running the Agent will report the appropriate information back to the Master which will note the running tasks and update JOC accordingly.
Delimitation
- This feature is not intended to support a Master Server Failure scenario.
Feature Availability
- FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2

Issues

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Master / Agent Recovery

Scenario

Master Server Failure
- an unrecoverable connection loss that takes longer than the period specified for the Connection Loss scenario (see Master / Agent Reconciliation) or
- a JobScheduler Master restart or server restart.

Feature

After a Master Server Failure the JobScheduler Master can be started in paused mode.
- This start mode prevents all jobs from being started. This applies to
  - jobs that have previously been requested for execution with Agents,
  - jobs that have been enqueued and
  - jobs that are scheduled for execution using start time events.
- All job starts that are delayed due to pausedmode will be executed after the JobScheduler is continued.
  - This also applies to jobs that are enqueued while paused mode is active.
  - The operation to continue JobScheduler is available with JOC.
- Paused mode allows users to manually check the job history and optionally remove enqueued tasks if Agent Reconciliation has not taken place.
  - The Agent stores log files of jobs during execution. If an execution result cannot be reported to the Master then the log file will be retained otherwise it will be removed (JS-1521).
  - Paused mode can be configured to be applied automatically in case of restart of a JobScheduler Master after failure (JS-1522).
Delimitation
- The currently supported measures include manual checking after failure. Automated recovery of the Master/Agent execution status after a Master Server Failure is subject to future improvements.
Feature Availability
- FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2

Issues

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Space shortcuts

Page tree

Scope

Agent Bundles

Feature

Issues

Master / Agent Reconciliation

Scenario

Feature

Issues

Master / Agent Recovery

Scenario

Feature

Issues