IBM Platform LSF 9.1.2 Readme File

Abstract:

When one job depends on the result of another job and the dependency condition is never satisfied, the dependent job never runs and remains in the system.

Readme file for: IBM® Platform LSF

Product/Component Release: 9.1.2

Update Name: Fix Pack

Fix ID: lsf-9.1.2-build229753

Publication date: 2014-02-07

Last modified date: 2014-02-07

Contents:

1.     Description

2.     Download location

3.     Products or components affected

4.     System requirements

5.     Known issues

6.     Known limitations

7.     Installation prerequisites

8.     Installation and configuration

9.     List of fixes

10.     List of files

11.  Copyright and trademark information

1.   Description

Often, complex workflows are required with job dependencies for proper job sequencing as well as job failure handling. For a given job, called the parent job, there can be child jobs which depend on its state before they can start. If one or more conditions are not satisfied, a child job remains pending. However, if the parent job is in a state such that the event on which the child depends will never occur, the child becomes an orphan job. For example, if a child job has a DONE dependency on the parent job but the parent ends abnormally, the child will never run as a result of the parent's completion and becomes an orphan job.

 

In some cases there may be a large number of jobs submitted but many will never run because they require dependency conditions that were never satisfied. Similarly, you may submit job A to do some pre calculation, and job B may consist of hundreds of analysis jobs that depend on job A generating inputs. If job A fails, hundreds of jobs wait for a condition that will never be true. As such, they become orphan jobs and remain pending in the LSF system.

 

Keeping orphan jobs in the system can cause performance degradation. The pending orphan jobs consume unnecessary system resources and add unnecessary loads to the daemons which can impact their ability to do useful work. You could use external scripts for monitoring and terminating orphan jobs, but that would add more work to mbatchd.

 

This feature gives LSF the capability to automatically terminate orphan jobs. There are two ways to use the automatic orphan termination feature:

 

Define a cluster-wide termination grace period

To avoid prematurely killing dependent jobs that users may still want to keep, LSF terminates a dependent job only after at least a configurable grace period has elapsed. The orphan termination grace period is defined as the minimum amount of time - starting from the point when a child job's dependency has become not valid – that the child job must wait before it is eligible for automatic orphan termination.

 

mbatchd periodically scans the job list and determines jobs for which the dependencies can never be met. The number of job dependencies to evaluate per session is controlled by the cluster-wide parameter EVALUATE_JOB_DEPENDENCY. If an orphan is detected and it meets the grace period criteria, mbatchd kills the orphan as part of dependency evaluation processing.

 

Due to various implementation and run-time factors (such as how busy mbatchd is serving other requests), the actual elapsed time prior to automatically killing dependent jobs can be longer than the specified grace period. But LSF ensures the dependent jobs are terminated only after at least the grace period has elapsed.

 

For multiple dependent jobs in a dependency tree, the grace period is not repeated at each dependency level. This avoids taking an extremely long time to terminate all dependent jobs in a large dependency tree. When a job is killed, its entire sub-tree of orphaned dependents can be killed after the grace period is expired.

 

The elapsed time for ORPHAN_JOB_TERM_GRACE_PERIOD is carried over after a restart, so that the set time for ORPHAN_JOB_TERM_GRACE_PERIOD is not restarted when LSF restarts.

 

For example, to use a cluster-wide termination grace period:

  1. Set ORPHAN_JOB_TERM_GRACE_PERIOD=90.
  2. Run badmin reconfig to have the changes take effect.
  3. Submit a parent job. For example:
    bsub -J "JobA" sleep 100
  4. Submit child jobs. For example:
    bsub -w "done(JobA)" sleep 100
  5. (Optional) Use commands such as bjobs -l, bhist -l or bparams -l to query orphan termination settings. For example:
    bparams -l
    Grace period for the automatic termination of orphan jobs:
             ORPHAN_JOB_TERM_GRACE_PERIOD = 90 (seconds)
  6. The parent job is killed. Some orphan jobs must wait for the grace period to expire before they can be terminated by LSF.
  7. Use commands such as bjobs -l, bhist -l or bacct -l to query orphaned jobs terminated by LSF. For example:
    bacct l <dependent job ID/name>:
    Job <job ID>, User <user1>, Project <default>, Status <EXIT>, Queue <normal>,
    Command <sleep 100>
    Thu Jan 23 14:26:27: Submitted from host <hostA>, CWD <$HOME/lsfcluster/conf>;
    Thu Jan 23 14:26:56: Completed <exit>; TERM_ORPHAN_SYSTEM: orphaned job terminated automatically by LSF.
    Accounting information about this job:
    CPU_T  WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP
     0.00    29         29   exit     0.0000  0M  0M

Enforce automatic orphan termination on a per-job basis

A -ti sub option of -w for bsub (i.e., bsub -w 'dependency_expression' [-ti]) allows users to indicate that a job is eligible for automatic and immediate termination by the system as soon as the job is found to be an orphan, without waiting for the grace period to expire. The behavior is enforced even if automatic orphan termination is not enabled at the cluster level. This is useful if a user does not want to use the grace period set by the administrator or if the feature is not enabled in the cluster to allow jobs to be terminated automatically by default.

 

Note that for bmod, -ti is a command option, not a sub-option, and you do not need to re-specify the original dependency expression from the -w option submitted with bsub.

 

This is also useful in the design of experimental scenarios where a job will spawn additional jobs to self propagate through a problem, similar to solving a maze. When a junction is reached, new jobs are spawned to search each possible direction, and keep repeating for each junction. From one initial job you can get a complex tree structure until one job reaches a solution. At that point all the other jobs are not needed. If you kill the other running jobs, all their dependent jobs are orphaned, and should be terminated.

 

With the -ti option, LSF only terminates a job as soon as mbatchd can detect it, evaluate its dependency and determine it to be an orphan. This means you may not see the job terminate immediately.

 

For example, to enforce automatic orphan termination on a per-job basis:

  1. Submit a parent job. For example:
    bsub -J "JobA" sleep 100
  2. Submit child jobs with the -ti option to ignore the grace period. For example:
    bsub -w "done(JobA)" -J "JobB" -ti sleep 100
  3. (Optional) Use commands such as bjobs -l, bhist -l or bparams -l to query
    orphan termination settings. For example:
    bhist l <dependent job ID/name>:
    Job <135>, Job Name <JobB>, User <user1>, Project <default>, Command <sleep 100>
    Thu Jan 23 13:25:35: Submitted from host <hostA>, to Queue <normal>, CWD
    <$HOME/lsfcluster/conf>, Dependency Condition <done(JobA)>
    - immediate orphan termination for job <Y>;
  4. The parent job is killed. LSF immediately and automatically kills the orphan jobs submitted with the -ti sub-option.
  5. Use commands such as bjobs -l, bhist -l or bacct -l to query orphaned jobs terminated by LSF. For example:
    bjobs l <dependent job ID/name>:
    Job <135>, Job Name <JobB>, User <user1>, Project <default>, Status <EXIT>,
    Queueue <normal>, Command <sleep 100>
    Thu Jan 23 13:25:42: Submitted from host <hostA>, CWD <$HOME/lsfcluster/conf/
    sbatch/lsfcluster/configdir>, Dependency Condition
    <done(JobA)> - immediate orphan termination for job <Y>;
    Thu Jan 23 13:25:49: Exited
    Thu Jan 23 13:25:49: Completed <exit>; TERM_ORPHAN_SYSTEM:
    orphaned job terminated automatically by LSF
    .

 

How LSF uses automatic orphan job termination

·         Orphan jobs terminated automatically by LSF are logged in lsb.events and lsb.acct. For example, you may see the following in lsb.events:
JOB_SIGNAL" "9.12" 1390855455 9431 -1 1 "KILL" 0 "system" "" -1 "" -1

o    If it is set to 1, a child job's dependency is evaluated based on the most recently submitted parent job with that name. So killing an older parent with that job name does not affect the child and does not cause it to become an orphan.

o    If it is not set, a child job's dependency is evaluated based on all previous parent jobs with that name. So killing any previous parent with that job name impacts the child job and causes it to become an orphan.

 

ORPHAN_JOB_TERM_GRACE_PERIOD

 

Syntax

ORPHAN_JOB_TERM_GRACE_PERIOD=seconds

 

Description

If defined, enables automatic orphan job termination at the cluster level which applies to all dependent jobs; otherwise it is disabled. This parameter is also used to define a cluster-wide termination grace period to tell LSF how long to wait before killing orphan jobs. Once configured, automatic orphan job termination applies to all dependent jobs in the cluster.

·   ORPHAN_JOB_TERM_GRACE_PERIOD = 0: Automatic orphan job termination is enabled in the cluster but no termination grace period is defined. A dependent job can be terminated as soon as it is found to be an orphan.

·   ORPHAN_JOB_TERM_GRACE_PERIOD > 0: Automatic orphan job termination is enabled and the termination grace period is set to the specified number of seconds. This is the minimum time LSF will wait before terminating an orphan job. In a multi-level job dependency tree, the grace period is not repeated at each level, and all direct and indirect orphans of the parent job can be terminated by LSF automatically after the grace period has expired.

The valid range of values is any integer greater than or equal to 0 and less than 2147483647.

 

Default

Not defined

 

Commands affected

2.   Download location

http://www.ibm.com/eserver/support/fixes/

 

3.   Products or components affected

Product/Component Name, Platform, Fix ID:

Platform LSF v9.1.2, Linux x86-64 Platform LSF v9.1.2, lsf-9.1.2-build229753

4.   System requirements

Linux x86-64

5.   Known issues

None

6.   Known limitations

LSF takes a best-effort approach to discovering orphaned jobs in a cluster, meaning that there is no guarantee that all jobs whose dependencies can never be satisfied are identified and reported as orphans.

For jobs that have invalid dependencies, bjdepinfo will show not satisfied.

7.   Installation prerequisites

LSF 9.1.2 already installed

8.   Installation and configuration

8.1  Before installation

1)     Log on to the LSF master host as root (LSF_TOP: Full path to the top-level installation directory of LSF.)

2)      Set your environment:

·         For csh or tcsh:  % source LSF_TOP/conf/cshrc.lsf

·         For sh, ksh, or bash:  $ . LSF_TOP/conf/profile.lsf

        8.2  Installation steps

1)     Go to the patch install directory:
cd $LSF_ENVDIR/../9.1/install/

2)     Copy the patch file to install directory $LSF_ENVDIR/../9.1/install/

3)     Run patchinstall: ./patchinstall <patch>

4)     Run badmin mbdrestart

8.3  After installation

None

8.4  Uninstalling

To roll back a patch:

1) Run as root

2) Run  ./patchinstall –r <patch>

3) Run badmin mbdrestart.

 

9.   List of fixes

N/A

 

10.   List of files

mbatchd, bacct, badmin, bhist, bjobs, bmod, bparams, brestart, bsub

 

11.       Copyright and trademark information

© Copyright IBM Corporation 2014

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.