Skip to end of metadata
Go to start of metadata

Introduction

This example uses a simple job chain which starts shell jobs to demonstrate the different behaviors that can be configured for JobScheduler if an error occurs in one of the jobs.

In particular, the effect of the stop_on_error and on_error parameters is demonstrated along with the use of suspended orders and setbacks to retry running a job.

Downloads

Instructions

Behavior with stop_on_error="no"

  • Unzip all files in the download into the ./config/live folder of your JobScheduler installation.
  • Open the JobScheduler Operating Center, JOC, in your browser using http://scheduler_host:scheduler_port
  • Open the JOB CHAINS tab and enable Show orders.
  • Find the job chain samples/shell_error/simple_error_chain.
  • Find the order simple_error_order, open the order menu and choose Start order now.

The order will now go move through both nodes of the job chain. On the second node, an error will occur due to exit 5 being included in the job's shell script. If the email settings of your JobScheduler are configured correctly, you will now receive an error mail.

Click on the second node job (samples/shell_error/simple_chained_job2) to open the Job pane. You will see that the second job has the pending state. This means that he job can process further orders (although in this example, they will all fail as long as exit 5 is specified). The error has been blamed on the order and the order has been moved to the state which was configured as error_state for the step in which the error happened. In the example, this is suspended. The error_state can also be used to configure error handling jobs, it need not point to a final state of the job chain.

If you change the exit code from exit 5 to exit 0 and click on the order menu, you will see that you can either resume the order or reset it:

  • resume will cause the order to rerun the second job,
  • reset will allow the order to be re-run.

stop_on_error="no" is the default setting for jobs created with JOE and has the advantage that a job is not blocked for all orders if one order should fail due, for example, to a configuration error .

The error can also be blamed on the job, which will be described in the next section.

Behavior with stop_on_error="yes"

  • Edit the job configuration file simple_chained_job2.job.xml
  • If you have changed the exit code (which caused the error) to exit 0 change it back to exit 5 to simulate an error again
  • Change stop_on_error="no" to stop_on_error="yes" and save
  • Run the order again
  • Look at the order history

Note that the job state of the second job is now stopped. This means that the job will no longer process any orders. The order simple_error_order is now enqueued before the job. Other orders running into this job will also be enqueued.

  • (optional) open the job chain menu (right click over the 'simple_error_chain' chain) and choose add order (leave everything empty) to add another order to the job chain which will also be enqueued at the second job.
  • (optional) click on the job and choose unstop from the job menu which will appear on the right side of the interface. This will cause the job to retry processing the orders but as the error remains, the job will be stopped again.
  • Edit the job configuration file simple_chained_job2.job.xml
    (Edit the file by either opening it in a code editor or using JOE - JobScheduler Object Editor.  Note that the job configuration can be read in JOC - but not edited - by opening the job context menu and selecting Show configuration. )
  • Change the exit 5 (which caused the error) to exit 0 and save the change.

Now click on  in  the Job Menu in JOC's Job Tab to unstop the job, which will take on the status pending. The next scheduled start for the order will be shown in green in the Job Chain tab.

  • Click on the job chain and then Show order history on the right side of the interface

You will see In the order history that processing of the order has ended.

This example has used the stop_on_error="yes" to blame the error on the job.

Suspending Orders

Another option in the event of an error is to suspend the order:

  • First of all, ensure that stop_on_error is set for both jobs to "no"
  • Then edit the job chain configuration file simple_error_chain.job_chain.xml:
    • On the next job_chain_node add a new on_error="suspend" attribute and save
  • Run the order again
  • When the error now occurs, the order will be put back into the order queue of the second job but it will be suspended.
    This means that the order will not run again, until somebody manually chooses  "resume" from the order menu.
  • Fix the job - i.e. change exit 5 to exit 0
  • Choose "resume" from JOC's order menu

Retry using "setback"

Alternative Example:

Note that we also have a dedicated example, showing the use of setbacks: How to use setbacks to make a job retry in the event an error

Another option is to configure automatic retries using "setback":

  • First of all, ensure that stop_on_error is set for both jobs to "no"
  • Then edit the simple_chained_job2.job.xml job configuration file
    • Put exit 5 into the job again
  • Add the following lines after the script element:

  • Save simple_chained_job2.job.xml
  • Edit the job chain configuration file simple_error_chain.job_chain.xml
  • On the job_chain_node "next" (the node for the simple_chained_job2 job) set the on_error attribute to "setback" and save
  • Run the order again

This time the order will run until the error occurs and will then be set back. The order is then enqueued at the second job with new start time (just 20 seconds later after the first error). Press update repeatedly to see the order count down the time for the next start.

After the 6th time the order has encountered an error, it will be set to the error_state.

If the job is fixed during the retries, the order will go to the next_state.

How it works

The main "switch" for controlling error handling of shell jobs is the stop_on_error attribute of a job. If stop_on_error is set to yes, the job is blamed for the error and is stopped. If stop_on_error is set to no, the order is blamed for the error. For more information on stop_on_error see http://www.sos-berlin.com/doc/en/scheduler.doc/xml/job.xml#attribute_stop_on_error

By default, if an order is blamed for an error - i.e. if stop_on_error is set to no, the order is moved to the error_state. This behavior can be changed at the job chain node with the on_error attribute. This can be set to "suspend" or "setback" and will cause the order to be either suspend or setback in the event of an error.

  • Note that this will only work for shell jobs when stop_on_error="no" is set for the job.

Jobs which use the JobScheduler API Interface may implement more sophisticated methods to choose whether an error is blamed on the job or on the order and how to handle errors that occur in orders.

Logging and Handling of Errors

Behavior up to and including Version 1.9

FEATURE AVAILABILITY ENDING WITH RELEASE 1.9

When a shell script is executed within a job and when this script writes messages to the standard output and error channels (stdout and stderr) then the JobScheduler treats these as info messages.

  • Logging:
    • the JobScheduler writes these messages to the task's log with severity "info", e.g.
      2015-03-18 07:57:38.991+0100 [info] This message goes to stdout
      2015-03-18 07:57:38.993+0100 [info] This message goes to stderr
    • The user cannot decide from the log if output from the shell script has been written to stdout or to stderr.
  • Error Handling:
    • the JobScheduler handles the script execution as being successful if the exit code returned by the script is 0.

Behavior with Version 1.10 and newer

FEATURE AVAILABILITY STARTING FROM RELEASE 1.10

Job error handling can optionally be extended to detect errors from output that is created by shell scripts.

JS-1393 - Identify output channel in JobScheduler logs Released

JS-1329 - Check stderr for errors in shell script execution Released

Behavior

  • Logging:
    • Messages received via the standard error channel are added with error severity, e.g.
      2015-03-18 07:57:38.993+0100 [error]  This message goes to stderr.
  • Error Handling:
    • The JobScheduler raises an error for any output to stderr from the shell script. This behavior extents JS-1393 that identfies the output channel that has been used.
    • Depending on the job settings and the job node settings the usual behavior for failed execution would apply, e.g. the job could be stopped, the order could be suspended, setback, etc.

Note that:

  • This option applies to shell jobs, not to API jobs.
  • It applies to jobs executed by JobScheduler instances, including clustered instances, and to JobScheduler Agents. An Agent forwards errors to the JobScheduler Master.

Configuration

  • Error handling for shell jobs is configured with the <job stderr_log_level="error|info"> job attribute.
    • A value error causes shell job output to stderr to be considered by JobScheduler as errors.
    • The default value is info and causes the JobScheduler not to raise an error.

Change Management References

Loading
T Key Linked Issues Fix Version/s Status P Summary Updated
Feature JS-1393 JS-1329 1.10 Released Major Identify output channel in JobScheduler logs Dec 16, 2015
Feature JS-1329 JS-1393 , JS-1615 , JOE-166 1.10 Released Major Check stderr for errors in shell script execution Apr 12, 2016