Skip to end of metadata
Go to start of metadata

Introduction

There are a few well known reasons, e.g. access to resources such as file system and database, that can cause JobScheduler to be blocked and to become unresponsive. In such a situation tasks might be stopped and it might not be possible to perform any operation on JobScheduler Master by the JOC Cockpit. In some of these situations a restart is required.

How to detect that a Master is not responding?

One or more of the following observations indicate a blocked Master:

  • the Master is not accessible from the JOC Cockpit. The connection is reported being unreachable.
  • the Master does not perform any operation and does not write to its log file.
  • the Master itself detects that it is slow and logs such information to the ./logs/scheduler.log file:
    • Example:
      SCHEDULER-721 Scheduler is not responding quickly, a micro-step took 00:00:10.06s
  • in a JobScheduler Cluster the Master detects that own heartbeats are late or that heartbeats of a corresponding Master are late:
    • Example:
      TODO

What are the main reasons for this behavior?

Typical reasons why JobScheduler Master is delayed include

  • a very slow database,
  • mount points in the file system that are blocked,
  • high server load, e.g. concerning memory, CPU.

Root Causes with Databases

The connection between JobScheduler and the database was interrupted. "socket was closed by server" may be the database server closed the connection or some network issue dropped the connection. It is also called as a deadlock in the database.


You can use server monitoring to check the utilization of the machine and the continuous availability of the file system. Also, take a look at the database: if it has not been cleaned up for a long time, there could be a million records in the history tables that slow down database access.

Table Locks

The database is locked and cannot perform any operation on the tables. The issue occurred when the process, task, operation are running for a long time, waiting for the data and resource, not able to add entries in the database, etc. Due to these, the tables are blocked and not accessible.

Cluster Locks

Due to the connectivity lost with the database, the cluster is locked and not able to execute any operation, and also after the primary master gets crashed the backup is not started as it does not get from the database due to connectivity problems between the master and database.

Root Causes with the File System 

File Watching issues with failed UNC paths

  • This problem can be related to Windows shares using the CIFS protocol. If such shares are failing or require credentials to be accessed that are not available then a JobScheduler Master and Agent can become unavailable for the duration of the timeout that is active with the CIFS protocol, e.g. 60s and more.
  • For Linux environments the Master or Agent could be stuck in an “inotify” callback. The problem is not the fact that the resource (mount point) failed or is missing but that the Master or Agent is stuck in a callback and cannot return before some timeout (that is not controlled by the Master/Agent) is exceeded.
Workaround
Recommendation
  • Check shares and mount points for availability and accessibility.

Root Causes with Server Load

Memory

  • Server Memory
    • If server memory is low then a Master might not be able to extend its memory usage. If no memory is available then a Master might crash.
    • Additional memory might be required if a Master locally runs a high number of jobs in parallel. 
    • This behavior does not apply to Agents that will not increase memory consumption but will remain within 100MB if not otherwise configured.
  • Java Heap Space
    • TODO

CPU Load

  • High CPU load is an indicator that things go wrong - at least if this takes place for a longer duration, i.e. > 120s.
  • You might see short periods (a few seconds) of high CPU usage that are completely normal. 
  • Consider the fact that each process tries to make full use of the CPU. In most cases after a few milliseconds processes wait for resources, such as file system, network etc. which would immediately reduce their CPU consumption. Therefore high CPU consumption for a short duration is fine and for a longer duration indicates a problem.
Workaround
  • Identify processes with high CPU load and stop them.
Recommendation
  • Use server monitoring to check CPU utilization with respective thresholds.




  • No labels
Write a comment…