How to handle errors in a job chain made up from shell jobs

Introduction

This example uses a simple job chain which starts shell jobs to demonstrate the different behaviors that can be configured for JobScheduler if an error occurs in one of the jobs.

In particular, the effect of the stop_on_error and on_error parameters is demonstrated along with the use of suspended orders and setbacks to retry running a job.

Downloads

shell_error.zip - configuration files

Instructions

Behavior with `stop_on_error="no"`

Unzip all files in the download into the ./config/live folder of your JobScheduler installation.
Open the JobScheduler Operating Center, JOC, in your browser using http://scheduler_host:scheduler_port
Open the JOB CHAINS tab and enable Show orders.
Find the job chain samples/shell_error/simple_error_chain.
Find the order simple_error_order, open the order menu and choose Start order now.

The order will now go move through both nodes of the job chain. On the second node, an error will occur due to exit 5 being included in the job's shell script. If the email settings of your JobScheduler are configured correctly, you will now receive an error mail.

Click on the second node job (samples/shell_error/simple_chained_job2) to open the Job pane. You will see that the second job has the pending state. This means that he job can process further orders (although in this example, they will all fail as long as exit 5 is specified). The error has been blamed on the order and the order has been moved to the state which was configured as error_state for the step in which the error happened. In the example, this is suspended. The error_state can also be used to configure error handling jobs, it need not point to a final state of the job chain.

If you change the exit code from exit 5 to exit 0 and click on the order menu, you will see that you can either resume the order or reset it:

resume will cause the order to rerun the second job,
reset will allow the order to be re-run.

stop_on_error="no" is the default setting for jobs created with JOE and has the advantage that a job is not blocked for all orders if one order should fail due, for example, to a configuration error .

The error can also be blamed on the job, which will be described in the next section.

Behavior with `stop_on_error="yes"`

Edit the job configuration file simple_chained_job2.job.xml
If you have changed the exit code (which caused the error) to exit 0 change it back to exit 5 to simulate an error again
Change stop_on_error="no" to stop_on_error="yes" and save
Run the order again
Look at the order history

Note that the job state of the second job is now stopped. This means that the job will no longer process any orders. The order simple_error_order is now enqueued before the job. Other orders running into this job will also be enqueued.

(optional) open the job chain menu (right click over the 'simple_error_chain' chain) and choose add order (leave everything empty) to add another order to the job chain which will also be enqueued at the second job.
(optional) click on the job and choose unstop from the job menu which will appear on the right side of the interface. This will cause the job to retry processing the orders but as the error remains, the job will be stopped again.
Edit the job configuration file simple_chained_job2.job.xml
(Edit the file by either opening it in a code editor or using JOE - JobScheduler Object Editor. Note that the job configuration can be read in JOC - but not edited - by opening the job context menu and selecting Show configuration. )
Change the exit 5 (which caused the error) to exit 0 and save the change.

Now click on in the Job Menu in JOC's Job Tab to unstop the job, which will take on the status pending. The next scheduled start for the order will be shown in green in the Job Chain tab.

Click on the job chain and then Show order history on the right side of the interface

You will see In the order history that processing of the order has ended.

This example has used the stop_on_error="yes" to blame the error on the job.

Suspending Orders

Another option in the event of an error is to suspend the order:

First of all, ensure that stop_on_error is set for both jobs to "no"
Then edit the job chain configuration file simple_error_chain.job_chain.xml:
- On the next job_chain_node add a new on_error="suspend" attribute and save
Run the order again
When the error now occurs, the order will be put back into the order queue of the second job but it will be suspended.
This means that the order will not run again, until somebody manually chooses "resume" from the order menu.
Fix the job - i.e. change exit 5 to exit 0
Choose "resume" from JOC's order menu

Retry using "setback"

Alternative Example:

Note that we also have a dedicated example, showing the use of setbacks: How to use setbacks to make a job retry in the event an error

Another option is to configure automatic retries using "setback":

First of all, ensure that stop_on_error is set for both jobs to "no"
Then edit the simple_chained_job2.job.xml job configuration file
- Put exit 5 into the job again

Add the following lines after the script element:

<delay_order_after_setback setback_count="1" delay="20"/>
<delay_order_after_setback setback_count="3" delay="60"/>
<delay_order_after_setback setback_count="6" is_maximum="yes"/>

Save simple_chained_job2.job.xml
Edit the job chain configuration file simple_error_chain.job_chain.xml
On the job_chain_node "next" (the node for the simple_chained_job2 job) set the on_error attribute to "setback" and save
Run the order again

This time the order will run until the error occurs and will then be set back. The order is then enqueued at the second job with new start time (just 20 seconds later after the first error). Press update repeatedly to see the order count down the time for the next start.

After the 6th time the order has encountered an error, it will be set to the error_state.

If the job is fixed during the retries, the order will go to the next_state.

How it works

The main "switch" for controlling error handling of shell jobs is the stop_on_error attribute of a job. If stop_on_error is set to yes, the job is blamed for the error and is stopped. If stop_on_error is set to no, the order is blamed for the error. For more information on stop_on_error see http://www.sos-berlin.com/doc/en/scheduler.doc/xml/job.xml#attribute_stop_on_error

By default, if an order is blamed for an error - i.e. if stop_on_error is set to no, the order is moved to the error_state. This behavior can be changed at the job chain node with the on_error attribute. This can be set to "suspend" or "setback" and will cause the order to be either suspend or setback in the event of an error.

Note that this will only work for shell jobs when stop_on_error="no" is set for the job.

Jobs which use the JobScheduler API Interface may implement more sophisticated methods to choose whether an error is blamed on the job or on the order and how to handle errors that occur in orders.

Logging and Handling of Errors

Behavior up to and including Version 1.9

FEATURE AVAILABILITY ENDING WITH RELEASE 1.9

When a shell script is executed within a job and when this script writes messages to the standard output and error channels (stdout and stderr) then the JobScheduler treats these as info messages.

Logging:
- the JobScheduler writes these messages to the task's log with severity "info", e.g.
  2015-03-18 07:57:38.991+0100 [info] This message goes to stdout
  2015-03-18 07:57:38.993+0100 [info] This message goes to stderr
- The user cannot decide from the log if output from the shell script has been written to stdout or to stderr.

Error Handling:
- the JobScheduler handles the script execution as being successful if the exit code returned by the script is 0.

Behavior with Version 1.10 and newer

FEATURE AVAILABILITY STARTING FROM RELEASE 1.10

Job error handling can optionally be extended to detect errors from output that is created by shell scripts.

JS-1393 - Getting issue details... STATUS

JS-1329 - Getting issue details... STATUS

Behavior

Logging:
- Messages received via the standard error channel are added with error severity, e.g.
```
2015-03-18 07:57:38.993+0100 [error]  This message goes to stderr.
```
Error Handling:
- The JobScheduler raises an error for any output to stderr from the shell script. This behavior extents JS-1393 that identfies the output channel that has been used.
- Depending on the job settings and the job node settings the usual behavior for failed execution would apply, e.g. the job could be stopped, the order could be suspended, setback, etc.

Note that:

This option applies to shell jobs, not to API jobs.
It applies to jobs executed by JobScheduler instances, including clustered instances, and to JobScheduler Agents. An Agent forwards errors to the JobScheduler Master.

Configuration

Error handling for shell jobs is configured with the <job stderr_log_level="error|info"> job attribute.
- A value error causes shell job output to stderr to be considered by JobScheduler as errors.
- The default value is info and causes the JobScheduler not to raise an error.

Change Management References

T	Key	Linked Issues	Fix Version/s	Status	P	Summary	Updated

Loading...

Refresh

Space shortcuts

Page tree

Introduction

Downloads

Instructions

Behavior with `stop_on_error="no"`

Behavior with `stop_on_error="yes"`

Suspending Orders

Retry using "setback"

Alternative Example:

How it works

Logging and Handling of Errors

Behavior up to and including Version 1.9

Behavior with Version 1.10 and newer

Behavior

Configuration

Change Management References

Space shortcuts

Page tree

How to handle errors in a job chain made up from shell jobs

Introduction

Downloads

Instructions

Behavior with stop_on_error="no"

Behavior with stop_on_error="yes"

Suspending Orders

Retry using "setback"

Alternative Example:

How it works

Logging and Handling of Errors

Behavior up to and including Version 1.9

Behavior with Version 1.10 and newer

Behavior

Configuration

Change Management References

Behavior with `stop_on_error="no"`

Behavior with `stop_on_error="yes"`