What are the most prominent root causes that block JobScheduler?

Introduction

There are multiple reasons due to that the JobScheduler gets a block and could not respond, due to that all the processes are stopped, and not able to perform any operation on JobScheduler master requires a restart to become up and running.

How to detect that a Master is not responding?

The JobScheduler hangs and could not perform any operation. All the running processes are stopped or failed due to JobScheduler crashed. After the JobScheduler timeout, the below error logged into the scheduler.log

SCHEDULER-721 Scheduler is not responding quickly, a micro-step took 00:00:10.06s

What are the main reasons for this behavior?

There are many reasons why JobScheduler Master can be delayed: This can be due to (very) slow database, the mounting system has blocked, the file system for a long time. It can be the load on the server (memory, CPU), etc.

JobScheduler tries to connect with the external program but due to slow response from the external programs, JobScheduler reaches its timeout and gets crashed. Also after an increase in timeout, the JobScheduler wait for the re-established a connection but block all the process until the timeout is not reached.

The connection between JobScheduler and the database was interrupted. "socket was closed by server" may be the database server closed the connection or some network issue dropped the connection. It is also called as a deadlock in the database.

You can use server monitoring to check the utilization of the machine and the continuous availability of the file system. Also, take a look at the database: if it has not been cleaned up for a long time, there could be a million records in the history tables that slow down database access.

File Watching issues with UNC paths

This problem is most probably related to Windows shares (CIFS) and to the fact that the Master is stuck in an “inotify” callback. The problem is not the fact that the resource (mount point) failed or is missing but that the Master is stuck in a callback and cannot return before some internal timeout (that is not controlled by the Master) is exceeded.

Workaround: Use Agents for file watching if mount points are not reliable.
Please refer knowledge base JobScheduler Universal Agent - Remote File Watching for detail information

Database blocking issues

Table Locks
The database is Locked and cannot perform any operation on the tables. The issue occurred when the process, task, operation are running for a long time, waiting for the data and resource, not able to add entries in the database, etc. Due to these, the tables are blocked and not accessible.

Cluster Locks
Due to the connectivity lost with the database, the cluster is locked and not able to execute any operation, and also after the primary master gets crashed the backup is not started as it does not get from the database due to connectivity problems between the master and database.

Space shortcuts

Page tree

Introduction

How to detect that a Master is not responding?

What are the main reasons for this behavior?

File Watching issues with UNC paths

Database blocking issues