Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

There are multiple reasons due to that the JobScheduler gets a block and could not respond, due to that all the processes are stopped, and not able a few well known reasons, e.g. access to resources such as file system and database, that can cause JobScheduler to be blocked and to become unresponsive. In such a situation tasks might be stopped and it might not be possible to perform any operation on JobScheduler master requires a restart to become up and runningMaster by the JOC Cockpit. In some of these situations a restart is required.

How to detect that a Master is not responding?

One or more of the following observations indicate a blocked Master:The JobScheduler hangs and could

  • the Master is not accessible from the JOC Cockpit. The connection is reported being unreachable.
  • the Master does not perform any operation

...

  • and does not write to its log file.
  • the Master itself detects that it is slow and logs such information to the ./logs/scheduler.log file:
    • Example:
      SCHEDULER-721 Scheduler is not responding quickly, a micro-step took 00:00:10.06s
  • in a JobScheduler Cluster the Master detects that own heartbeats are late or that heartbeats of a corresponding Master are late:
    • Example:
      TODO

What are the main reasons for this behavior?

There are many Typical reasons why JobScheduler Master can be delayed:  This can be due to (very) is delayed include

  • a very slow database,
  • mount points in the

...

  • file system

...

  • that are blocked,
  • high server load, e.g. concerning memory, CPU.

Root Causes with Databases

The connection between JobScheduler and the database was interrupted. "socket was closed by server" may be the database server closed the connection or some network issue dropped the connection. It is also called as a deadlock in the database.


You can use server monitoring to check the utilization of the machine and the continuous availability of the file system. Also, take a look at the database: if it has not been cleaned up for a long time, there could be a million records in the history tables that slow down database access.

File Watching issues with UNC paths

This problem is most probably related to Windows shares (CIFS) and to the fact that the Master is stuck in an “inotify” callback. The problem is not the fact that the resource (mount point) failed or is missing but that the Master is stuck in a callback and cannot return before some internal timeout (that is not controlled by the Master) is exceeded.

...

Database blocking issues

Table Locks

The database is Locked locked and cannot perform any operation on the tables. The issue occurred when the process, task, operation are running for a long time, waiting for the data and resource, not able to add entries in the database, etc. Due to these, the tables are blocked and not accessible.

Cluster Locks

Due to the connectivity lost with the database, the cluster is locked and not able to execute any operation, and also after the primary master gets crashed the backup is not started as it does not get from the database due to connectivity problems between the master and database.

Root Causes with the File System 

File Watching issues with failed UNC paths

  • This problem can be related to Windows shares using the CIFS protocol. If such shares are failing or require credentials to be accessed that are not available then a JobScheduler Master and Agent can become unavailable for the duration of the timeout that is active with the CIFS protocol, e.g. 60s and more.
  • For Linux environments the Master or Agent could be stuck in an “inotify” callback. The problem is not the fact that the resource (mount point) failed or is missing but that the Master or Agent is stuck in a callback and cannot return before some timeout (that is not controlled by the Master/Agent) is exceeded.
Workaround
Recommendation
  • Check shares and mount points for availability and accessibility.

Root Causes with Server Load

Memory

  • Server Memory
    • If server memory is low then a Master might not be able to extend its memory usage. If no memory is available then a Master might crash.
    • Additional memory might be required if a Master locally runs a high number of jobs in parallel. 
    • This behavior does not apply to Agents that will not increase memory consumption but will remain within 100MB if not otherwise configured.
  • Java Heap Space
    • TODO

CPU Load

  • High CPU load is an indicator that things go wrong - at least if this takes place for a longer duration, i.e. > 120s.
  • You might see short periods (a few seconds) of high CPU usage that are completely normal. 
  • Consider the fact that each process tries to make full use of the CPU. In most cases after a few milliseconds processes wait for resources, such as file system, network etc. which would immediately reduce their CPU consumption. Therefore high CPU consumption for a short duration is fine and for a longer duration indicates a problem.
Workaround
  • Identify processes with high CPU load and stop them.
Recommendation
  • Use server monitoring to check CPU utilization with respective thresholds.