How errors are handled in a job chain made up from shell jobs

Introduction

This example configures a job chain which starts shell jobs and shows what happens if an error occurs in one of the jobs.

Downloads

shell_error.zip - configuration files

Instructions

Unzip all files in the download into the ./config/live folder of your JobScheduler installation.
Open the JobScheduler Operating Center, JOC, in your browser using http://scheduler_host:scheduler_port
Open the JOB CHAINS tab and enable Show orders.
Find the job chain samples/shell_error/simple_error_chain.
Find the order simple_error_order, open the order menu and choose Start order now.

The order will now go move through both nodes of the job chain. On the second node, an error will occur. If the email settings of your JobScheduler are configured correctly, you will now receive an error mail.

Click on the second node job (samples/shell_error/simple_chained_job2, colored red) to open the Job pane. Note that the job state of the second job is now stopped. This means that the job will no longer process any orders. The order simple_error_order is now enqueued before the job. Other orders running into this job will also be enqueued.

(optional) open the job chain menu (right click over the 'simple_error_chain' chain) and choose add order (leave everything empty) to add another order to the job chain which will also be enqueued at the second job.
(optional) click on the job and choose unstop from the job menu which will appear on the right side of the interface. This will cause the job to retry processing the orders but as the error remains, the job will be stopped again.
Edit the job configuration file simple_chained_job2.job.xml
(Edit the file by either opening it in a code editor or using JOE - JobScheduler Object Editor. Note that the job configuration can be read in JOC - but not edited - by opening the job context menu and selecting Show configuration. )
Change the exit 5 (which caused the error) to exit 0 and save the change.

Once the JobScheduler has noticed the change in the configuration file, it will update the job definition and unstop the job automatically. The order(s) will then be able to run successfully through the job.

Click on the job chain and enable Show order history on the right side of the interface

In the order history, you will see, that the order has ended in the success state.

In this example we blamed the error on the job. The error can also be blamed on the order:

Edit the job configuration file simple_chained_job2.job.xml
Change exit 0 (which caused the error) to exit 5 to simulate an error again
Change stop_on_error="yes" to stop_on_error="no" and save
On non-Windows systems wait 60s for the JobScheduler to notice the change (or check incl. hot folders and press update, if your version of the interface supports this feature)
Run the order again
Look at the order history

This time, the order has ended in the error state and the job has not been stopped. The job can process further orders (although they will all fail in this example). The error has been blamed on the order and the order has been moved to the state which was configured as error_state for the state in which the error happened. The error_state can also be used to configure error handling jobs, it need not point to a final state of the job chain.

Another option is to suspend the order:

Edit the job chain configuration file simple_error_chain.job_chain.xml
On the job_chain_node next add a new attribute on_error="suspend" and save
Run the order again

When the error now occurs, the order will be put back into the queue of the second job but it will be suspended. This means that the order will not run again, until somebody manually chooses "resume" from the order menu.

Fix the job
Choose resume from the order menu

Another option is to configure automatic retries using "setback":

Edit the job configuration file simple_chained_job2.job.xml
Put exit 5 into the job again

Add the following lines after the script element:

<delay_order_after_setback setback_count="1" delay="20"/>
<delay_order_after_setback setback_count="3" delay="60"/>
<delay_order_after_setback setback_count="6" is_maximum="yes"/>

save simple_chained_job2.job.xml
Edit the job chain configuration file simple_error_chain.job_chain.xml
On the job_chain_node "next" set the on_error attribute to "setback" and save
Run the order again

This time the order will run until the error occurs and will then be set back. The order is then enqueued at the second job with new start time (just 20 seconds later after the first error). Press update repeatedly to see the order count down the time for the next start.

After the 6th time the order has encountered an error, it will be set to the error_state.

If the job is fixed during the retries, the order will go to the next_state.

How it works

The main "switch" for controlling error handling of shell jobs is the stop_on_error attribute of a job. If stop_on_error is set to yes, the job is blamed for the error and is stopped. If stop_on_error is set to no, the order is blamed for the error. For more information on stop_on_error see http://www.sos-berlin.com/doc/en/scheduler.doc/xml/job.xml#attribute_stop_on_error

By default, if an order is blamed for an error it is moved to the error_state. This behavior can be changed at the job chain node with the on_error attribute. This can be set to "suspend" or "setback" to suspend or setback the order in case of errors. When using shell jobs this only works if the job is set to stop_on_error="no".

Jobs which use the JobScheduler API may implement more sophisticated methods to choose whether an error is blamed on the job or on the order and how to handle an erroneous order.

Space shortcuts

Page tree

Introduction

Downloads

Instructions

How it works