Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • In case where there is no connection loss:
    • the Master sends a HTTP Request to the Agent
    • the Agent sends to the Master
      • a heartbeat after 10s to the Master should no other HTTP operation on behalf of the Master be executed.
      • a HTTP response when an operation is executed on behalf of the Master.
  • In case of connection loss after the Master has sent a first HTTP Request:
    • the Master waits 12s for the heartbeat from the Agent to arrive
    • If a heartbeat from the Agent came between 10s and 12s (=10s Heartbeat Period + 2s Heartbeat Delay), any running tasks will be continued and completed by the Agent.
    • If the Master did not receive the heartbeat from the Agent after 12s, the Master will repeat the first HTTP Request sent 12s ago until the Agent is able to answer
      • If the Agent is able to answer before 60s effected - that is, 48s after the HTTP Request repeat, any running tasks will be continued and completed by the Agent. Even though there were more HTTP Requests from the Master, the tasks will be executed just once.
      • If the Agent is not able to answer before 60s effected - that is, 48s after the HTTP Request repeat, the Master will kill any running tasks on the Agent.

...

  • If the Master successfully re-connects to the Agent within the grace period then
    • running tasks will be continued and completed by the Agent.
    • the task status and execution result will be reported to the Master.
  • In case of reconciliation the task status, log information and execution result are available for the Master and are visible with JOC.

Continue Tasks during longer Network Outages

  • Longer network timeouts are considered the duration of a few minutes and longer.
  • This needs a configuration with increased values for Heartbeat Timeout and Heartbeat Period with the Process Class that is assigned the job.
    • The Agent should get a long Heartbeat Timeout (longer than the assumed network outage).
    • Heartbeat Period should also have a higher value than by default, since every failed connection attempt is logged by the Master and thus the protocol becomes unnecessarily large.
    • Example
      • To cover a 2 hours's network outage with the Master contacting the Agent every 2 minutes specify: heartbeat_timeout="7200" heartbeat_period="120"
      • Higher values than this are accepted. Please consider that
        • the Heartbeat Period specifies the duration after which the result, e.g. the job execution history, is available with the Master. Higher values therefore delay delivery of the job execution history.
        • your firewall should not consider the connection between Master and Agent being idle due to a high value of the Heartbeat Period and therefore should not drop the connection between both components.

Configuration

  • The heartbeat settings can be configured with the Process Classes that specify the Agent connection. 
  • The configuration is located with the Master, no configuration items are stored with the Agent.

...

  • Heartbeat Period: http_heartbeat_period
    • The period after which the Agent sends a heartbeat to Master should no other HTTP operation on behalf of the Master be executed.
    • Default: 10s
  • Heartbeat Timeout: http_heartbeat_timeout
    • The overall timeout that determines if a connection is considered to be lost permanently.
    • Includes the heartbeat period and the delay after which the Master will send its heartbeat.
    • The Heartbeat Timeout has to be a multiple of the Heartbeat Period
    • Default: 60s

Example

Code Block
languagexml
titleheartbeat settings
<?xml version="1.0" encoding="utf-8"?>
<process_class>
    <remote_schedulers>
        <remote_scheduler remote_scheduler="http://127.0.0.2:4445" http_heartbeat_period="10" http_heartbeat_timeout="60"/>
    </remote_schedulers>
</process_class>

...

  • Connection heartbeats tend to render the use of keep-alive packets superfluous, see Connection Keep-Alive for Master and Agent
  • Connection hearbeats heartbeats are used to detect a connection loss and to re-establish a connection within short time.
    • They are not intended to cover longer network outages.
    • They are not intended for recovery scenarios, i.e. both Master and Agent have to be up and running. If one of the components is restarted then this is considered a recovery scenario.

...

Jira
serverSOS JIRA
columnstype,key,issuelinks,fixversions,status,priority,summary,updated
maximumIssues20
jqlQuerylabels in (high-availability) and labels in (heartbeat)
serverId6dc67751-9d67-34cd-985b-194a8cdc9602

 


Documentation

...