You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Scope

  • JobScheduler Master and Agents check availability of the communication partner by regularly sending heartbeats. 
  • Heartbeats are sent via the HTTP connection that is established by the Master to the Agent. Bi-directional heartbeats make use of this connection.
    • The Agent receives HTTP POST requests from the Master and will respond within short time, independently from the completion of the command that has been requested by the Master.
    • The Master will repeat sending further HTTP POST requests and accepting acknowledgements until the Agent sends the final response, i.e. after completion of a task.
  • This allows Master and Agent to check if a connection has been lost and if it can be re-established.
  • FEATURE AVAILABILITY STARTING FROM RELEASE 1.10.2

Related Features

JS-1523 - Getting issue details... STATUS

JS-1524 - Getting issue details... STATUS

Concepts

  • Heartbeat Period: 
    • The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
    • Default: 10s
  • Heartbeat Timeout: 
    • The overall timeout that determines if a connection is considered to be lost permanently.
    • Includes the heartbeat period and the delay after which the Master will send its heartbeat.
    • Default: 60s
  • Heartbeat Delay:
    • The time that the Master waits for before the Agent's heartbeat should have come. 
    • Value: 2s
    • This is fixed parameter and can not be customized.

Use Case

Kill Tasks in case of Connection Loss

  • If the Agent receives no heartbeats from the Master within 60 seconds then the Agent will 
    • assume the connection to be lost and
    • kill any running tasks that have been requested by that Master.
    • This behavior is intended to prevent simultaneous duplicate execution of tasks by an Agent. 
  • If the Master receives no heartbeats from the Agent within the interval between 50 and 60 seconds then it will 
    • consider the task being lost, e.g. its request for execution of a task not to have been received by the Agent, and will assign the task an error state,
    • try to re-establish the connection to the Agent,
    • repeat the request for task execution if the connection to the Agent can be established.
  • In this situation the Agent will 
    • within a configurable grace period
      • continue any running tasks.
      • try to identify duplicate requests for task execution from the Master and drop duplicate requests if the task is running.
    • kill the running tasks if the grace period is exceeded.

Continue Tasks in case of Reconciliation

  • If the Master successfully re-connects to the Agent within the grace period then
    • running tasks will be continued and completed by the Agent.
    • the task status and execution result will be reported to the Master.
  • In case of reconciliation the task status, log information and execution result are available for the Master and are visible with JOC.

Behavior

Let suppose a connection between a Master and an Agent. The Master and the Agent will behave as follows:

  • In case where there is no connection loss:
    • the Master sends a HTTP Request to the Agent
    • the Agent sends
      • a heartbeat after 10s to the Master should no other HTTP operation on behalf of the Master be executed.
      • a HTTP response when an operation is executed on behalf of the Master.
  • In case the connection is lost after the Master has sent a HTTP Request:
    • the Master waits 12s for the heartbeat from the Agent to arrive.
      • The Agent should answer with the heartbeat after 10s
      • The Master waits 2s more just in case - this is the Heartbeat Delay mentioned above. 
    • If a heartbeat from the Agent came within 12s, any running tasks will be continued and completed by the Agent.
    • Otherwise, the Master repeats the HTTP Request sent 12s before and repeats this action until the Agent is able to answer
      • If the Agent is able to answer before 60s effected - that is, 48s after the HTTP Request repeat, any running tasks will be continued and completed by the Agent.
      • If the Agent is not able to answer before 60s effected - that is, 48s after the HTTP Request repeat, the Master will kill any running tasks on the Agent.

 

Configuration

  • The heartbeat settings can be configured with the Process Classes that specify the Agent connection. 
  • The configuration is located with the Master, no configuration items are stored with the Agent.

Settings

  • Heartbeat Period: http_heartbeat_period
    • The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
    • Default: 10s
  • Heartbeat Timeout: http_heartbeat_timeout
    • The overall timeout that determines if a connection is considered to be lost permanently.
    • Includes the heartbeat period and the delay after which the Master will send its heartbeat.
    • Default: 60s

Example

keep-alive parameter
<?xml version="1.0" encoding="utf-8"?>
<process_class>
    <remote_schedulers>
        <remote_scheduler remote_scheduler="http://127.0.0.2:4445" http_heartbeat_period="10" http_heartbeat_timeout="60"/>
    </remote_schedulers>
</process_class>

Delimitation

  • Connection heartbeats tend to render the use of keep-alive packets superfluous, see Connection Keep-Alive for Master and Agent
  • Connection hearbeats are used to detect a connection loss and to re-establish a connection within short time.
    • They are not intended to cover longer network outages.
    • They are not intended for recovery scenarios, i.e. both Master and Agent have to be up and running. If one of the components is restarted then this is considered a recovery scenario.

References

Change Management References

Key Summary T Created Updated Due Assignee Reporter P Status Resolution Fix Version/s
Loading...
Refresh

Documentation

 

  • No labels