IBM Spectrum LSF 10.1 Fix Pack 2 (442293) Readme File
Abstract
LSF Version 10.1 Fix Pack 2. This Fix Pack includes new issues and solutions resolved between 21 October 2016 and 15 February 2017. For detailed descriptions of the issues and solutions in this Fix Pack, refer to the LSF 10.1 Fix Pack 2 Fixed Bugs List (lsf10.1.0.2_fixed_bugs.pdf can be downloaded from Fix central via fix ID lsf-10.1.0.2-spk-2017-Feb-build442293).
Description
Readme documentation for IBM Spectrum LSF 10.1 Fix Pack 2 (442293) including installation-related instructions, prerequisites and co-requisites, and list of fixes.
The new issues addressed in LSF Version 10.1 Fix Pack 2:
ID Fixed Date Description P102110
2017/03/31
This fix enhances LSF security of authorizing user credentials for the data stream between LSF clients and servers. It addresses CVE-2017-1205.
P102095
2017/02/10
If the parameter JOB_INCLUDE_POSTPROC is set to Y in lsb.params, a host will enter an "unreach" status when multiple jobs are dispatched to it simultaneously.
P102084
2017/02/15
When mbatchd exits abnormally, either by itself or if it is killed by a signal, the sbatchd on the master host cannot start a new mbatchd and keeps logging the following message: P102083
2017/01/27
Interactive jobs cannot run when bsub gets input from a pipe or file. For example: P102081
2017/01/24
The mbatchd might core dump if the job has dependency conditions that are changed using the bmod command.
P102075
2017/01/11
Job report output does not show start time stamp or report time stamp.
P102064
2017/01/18
When the total memory of all hosts in the cluster exceeds 2147483647 MB there is a warning message in mbatchd log: P102058
2017/01/12
Child mbatchd core dump when job priority changes.
P102053
2016/12/23
For interactive jobs, when the job queue-level pre-execution function is enabled, the runtime limit does not work.
P102052
2016/12/23
If RESOURCE_RESERVE_PER_TASK is configured in the lsb.params file and mem is not defined as a reservation resource, a job with the requirement "rusage[mem=X] span[block=Y]" cannot run on a host even if the resource requirement is satisfied.
P102034
2016/12/18
When LSB_BJOBS_CHUNK_JOB_START_TIME is set to y or Y, bjobs -l displays the job start time for chunk jobs (instead of the chunk start time).
P102027
2016/12/09
When the queue is configured with SLOT_RESERVE or RESOURCE_RESERVE and there are many pending jobs in the queue, the scheduler takes a long time to make the reservation and the mbatchd daemon automatically restarts the scheduler.
P102026
2016/12/09
When the "-env" option is specified in the bsub command, the LSF_SUB4_SUB_ENV_VARS environment variable cannot be found in the $LSB_SUB_PARM_FILE.
P101991
2016/12/11
When LSB_EXTENDED_RSRCREQ_STR=Y is configured in the lsf.conf file, the LSF system rejects jobs with resource requirement strings greater than 511 characters in length.
P101990
2016/11/23
If some LSF job processes have the same pgid as the job RES (for sequential jobs) or task RES (for parallel jobs), but these processes are not in the cgroup that LSF created for the job, when all processes in the cgroup are done or exited, the job cannot be finished. The bjobs command always shows the job as running.
P101984
2016/11/28
When a job is finished, and is still in the memory of the mbatchd daemon, the lsb_readjobinfo API returns a value of zero for the job's memory and swap usage.
P101976
2017/01/20
If the version of Kerberos is 1.12 and above, LSF fails to renew a ticket when using the LSF Ticket Granting Ticket forwarding feature. P101974
2016/11/15
This fix is for two issues: P101973
2016/11/09
In MultiCluster lease mode, if a job used a leased-in resource with a duration in rusage, subsequent jobs cannot use this resource until the previous job is finished.
P101972
2016/11/09
Suspended jobs do not show the suspend reason in the IBM Spectrum LSF Application Center pending reason section.
P101959
2016/11/03
The previous way for LSF to calculate a host’s effective run queue length was complicated and hard to use. This made it difficult to correctly set the host’s load-based threshold. This fix provides a simpler way to calculate run queue length. The new behavior is parameter controlled, enabled by defining the parameter LSF_LIM_MULTICORE_ADJUST in lsf.conf.
P101957
2016/10/24
When one job fails to do a checkpoint, and exits due to other reasons, LSF cannot set the LSB_JOBEXIT_INFO environment variable value.
P101956
2016/11/03
When daemons are started on slave servers, LSB_SHAREDIR is automatically mounted on slave servers.
136800
2017/01/11
If the LSF external scheduler API is used to make an external scheduler plugin, and if the callback routine that adjusts the allocation decision returns SCH_MOD_DECISION_PENDJOB directly, then after submitting some jobs the memory image size of the mbschd process will continue to increase.
123437
2016/10/26
When the if/else branches of a time window are defined with overlapping times, mbatchd will perform more reconfig actions.
The new solutions in LSF Version 10.1 Fix Pack 2:
ID Fixed Date Description 117458 2017/02/10
This enhancement adds three new features to LSF Resource Connector:
RFE#77671 2017/02/10
This enhancement to the LSF command UI adds new options and output modifications to the bjobs command as follows:
RFE#83517 2017/02/10
This enhancement to the LSF command UI adds new options and output modifications to the bacct, bclusters, bhist, bhosts, bqueues, and lsinfo commands as follows:
RFE#85943 2017/01/26
This solution adds useful information to the preemption suspend reason by adding the job ID of the preemptive job to the suspend reason of the preempted job. This allows users to determine who preempted the job by running the bjobs -l and bjobs -s commands.
RFE#98596 2017/01/25
This solution introduces a new parameter LSB_JOB_REPORT_FILTER in lsf.conf, and a corresponding environment variable, to allow users to select the information added to a job's report. Valid values are: all, jobinfo, rusage, stdout, stderr.
RFE#83087 2017/01/18
This solution allows certain users (root, the primary LSF administrator, the parent group owner, and the current group owner) to change the job group owner in LSF by using the bgmod command. RFE#77813 2017/01/17
This solution enables LSF_UNIT_FOR_LIMITS to define a unit for "tmp", and also supports the specified unit in resource requirements and limits in configuration files and on the command line. A new parameter LSF_ENABLE_TMP_UNIT in lsf.conf enables LSF_UNIT_FOR_LIMITS to support limits on "tmp".
127009 2017/01/10
This enhancement adds a new parameter EVALUATE_JOB_DEPENDENCY_TIMEOUT in lsb.params. It limits the amount of time mbatchd spends on evaluating job dependencies in an mbatchd session. 127013 2017/01/10
The default value of the LSF_HOST_CACHE_NTTL parameter in the lsf.conf file is 20 seconds, which is too small in most cases. This fix increases the default value from 20 seconds to 60 seconds.
127010 2017/01/10
This fix includes the intelligent CPU binding feature, which enables LSF to automatically bind critical LSF master daemons (lim, mbatchd, mbschd) to the CPU cores based on the detected hardware information on the LSF master or master candidate host. P102016 2017/01/10
This solution allows LSF to run jobs in Singularity or Shifter containers on demand. LSF manages the entire life cycle of jobs running in these containers as common jobs.
114275 2017/01/08
This enhancement introduces the new API lsb_wait and the new command bwait to get notifications from LSF when the wait condition changes. The wait condition that is specified in the new API and command uses the same syntax as the existing job dependency syntax.
RFE#81801 2017/01/05
This enhancement allows LSF to accurately account the number of slots consumed by jobs with affinity requirements. LSF automatically adjusts the number of slots based on the number of affinity CPUs that are allocated for the job.
P102015 2016/12/20
This enhancement introduces a new keyword "$LSB_CONTAINER_IMAGE" that is defined in the CONTAINER parameter of the lsb.applications file. When this keyword is defined, an LSF user replaces this value by defining the LSB_CONTAINER_IMAGE environment variable at job submission time. The job fails if this environment variable is not defined.
127012 2016/12/06
This enhancement makes mbatchd parallel restart the default mbatchd restart behavior (instead of serial restart). When using badmin mbdrestart, LSF uses parallel mbatchd restart restart. This fix adds a new option -s (badmin mbdrestart -s) to use mbatchd serial restart instead.
RFE#92854 2016/12/01
In the current LSF job mail report for job start mail and job done mails, some action times (such as submission time and terminate time) are missing. 112279 2016/11/30
This solution integrates DCGM (Data Center GPU Manager) with LSF to report a job's GPU utilization and to check the DCGM status before the job starts. Readme file for:
IBM®
Spectrum LSF Product/Component Release:
10.1
Update Name:
Fix 442293
Fix ID:
lsf-10.1.0.2-spk-2017-Feb-build442293
Publication date:
14 April 2017
Last modified date:
14 April 2017
Contents:
1. List of fixes
2. Download location
3. Products or components affected
4. System requirements
5. Installation and configuration
6. List of files
7. Product notifications
8. Copyright and trademark information
1. List of fixes
P102110, P102095, P102084, P102083, P102081, P102075, P102064, P102058, P102053, P102052, P102034, P102027, 2. Download Location Download
Fix 442293
from the following location: http://www.ibm.com/eserver/support/fixes/ 3. Products or components affected
Components affected by the new issues addressed in LSF Version 10.1 Fix Pack 2 include:
4. System requirements
Linux2.6-glibc2.3-x86_64
5. Installation and configuration 5.1 Before installation LSF_TOP=Full path to the top-level installation directory of LSF. 1) Log on to the LSF master host as root 2) Set your environment: - For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf - For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf 5.2 Installation steps 1) Go to the patch install directory: cd $LSF_ENVDIR/../10.1/install/ 2) Copy the patch file to the install directory $LSF_ENVDIR/../10.1/install/ 3) Run 4) Run patchinstall: ./patchinstall <patch> 5.3 After installation 1) Run 2) Run 3) Run 5.4 Uninstallation To roll back a patch: 1) Log on to the LSF master host as root 2) Set your environment: - For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf - For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf 3) Run 4) Run ./patchinstall -r <patch> 5) Run 6) Run 7) Run
6. List of files in package
filelist.txt
7. Product notifications
To receive information about product solution and patch updates automatically, subscribe to product notifications on the My notifications page (www.ibm.com/support/mynotifications) on the IBM Support website (support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.
8. Copyright and trademark information
© Copyright IBM Corporation 2017
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM®, the IBM logo and ibm.com®
are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.
houseKeeping: kill mbatchd
Case 1: echo sh | bsub -I (Note: Only -I is supported for this fix, -Is, -Ip, -ISs and -ISp are not supported.)
Case 2: bsub -I < file (Note: "bash", "csh", "ksh" or "sh" is the command in the file).
"The total memory in the cluster used for calculating the fairshare adjustment exceeds the maximum and has been reset to 2147483647."
Meanwhile, Kerberos provides two pre-authentication mechanisms. If the preauth attribute has been configured for krbtgt principle on Kerberos 1.12 and above, the administrator should configure the pre-authentication mechanism in kdc.conf and krb5.conf.
1. In a multi-cluster environment, when LC_TRACE is set for LSB_DEBUG_MBD in lsf.conf, a child mbatchd that sends jobs to the scheduler may crash if there are jobs forwarded to a remote cluster. This causes the scheduler to keep exiting and not schedule jobs.
2. The mbatchd daemon crashes after a job clean action on a job that is terminated after triggering the hung job removal policy, defined through the REMOVE_HUNG_JOBS_FOR parameter.
1. The cluster administrator can configure LSF Resource Connector to run a custom pre-script before a new cloud instance is created and a post script after the instance has been terminated. The scripts are executed on the LSF master host. Refer to https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_connector/lsf_rc_pre_post_prov.html.
2. Additional configuration options are provided to control the rate at which new cloud instances are created. Refer to https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_connector/lsf_rc_custom_policy.html.
3. Removal of cloud instances will align with billing periods. For example, AWS instances will be kept for the full billing hour.
RFE#96098
RFE#87525
RFE#98593
RFE#99791
RFE#100341
- Support for more fields in the customized output format.
- Adds a PSUSP column to the -sum option output.
- Adds a new option -json for use with the -o option to display the customized output in JSON format.
- Adds a new option -ignorebadjobid to remove unmatched jobs from the output.
- Adds a new option -hms to format display time in the customized output as hh:mm:ss.
- Adds LSB_HMS_TIME_FORMAT in lsf.conf and an environment variable to format display times in the customized output as hh:mm:ss. This affect fields that display the time, including the following: plimit_remain, eplimit_remain, action_warning_time, pend_time, ependtime, ipendtime, estimated_run_time, ru_utime, ru_stime, run_time, runtimelimit, effective_plimit, effective_eplimit, plimit, eplimit, cpu_used, time_left
RFE#82474
RFE#66061
RFE#73061
RFE#86691
- Adds a new option -o to bqueues and bhosts to support a customized output format.
- Adds a new option -json to bqueues and bhosts for use with the -o option to display the customized output in JSON format.
- Adds a new option -noheader to bqueues and bhosts to remove column headings from the output.
- Adds LSB_BQUEUES_FORMAT and LSB_BHOSTS_FORMAT in lsf.conf and an environment variable to support customized output format.
- Adds a new option -UF to bhist and bacct to display non-formatted job detail information.
- Adds a new opton -w to lsinfo and bclusters to display the information in wide format.
To change the job group owner, use the new -u option with bgmod.
RFE#93489
If you define both EVALUATE_JOB_DEPENDENCY and EVALUATE_JOB_DEPENDENCY_TIMEOUT, only EVALUATE_JOB_DEPENDENCY_TIMEOUT takes effect and EVALUATE_JOB_DEPENDENCY is ignored.
Enable this feature by setting the LSF_INTELLIGENT_CPU_BIND parameter to Y in the lsf.conf file.
This enhancement adds the submission time, start time, and terminate time for job start/done mail and supports the epoch second format display in start/done/post-exec mail.
The parameter "LSB_MAIL_TIMESTAMP_EPOCH_SECONDS" is added to lsf.conf. When the value is set to Y or y, the epoch second time is used.
Introduces a new parameter LSF_DCGM_PORT to the lsf.conf file. LSF uses this port to communicate with the DCGM daemon. Define this parameter to enable DCGM features with LSF.
For further details on these solutions, refer to https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_release_notes/lsf_relnotes_whatsnew10.1.0.2.html
P102026, P101991, P101990, P101984, P101976, P101974, P101973, P101972, P101959, P101957, P101956, 136800(No APAR),
123437(No APAR), 117458(No APAR), RFE#77671, RFE#96098, RFE#87525, RFE#98593, RFE#99791, RFE#100341, RFE#83517,
RFE#82474, RFE#66061, RFE#73061, RFE#86691, RFE#85943, RFE#98596, RFE#83087, RFE#77813, RFE#93489, 127009(No APAR),
127013(No APAR), 127010(No APAR), P102016, 114275(No APAR), RFE#81801, P102015, 127012(No APAR), RFE#92854, 112279(No APAR)
LSF/lsbatch.h
LSF/lsf.h
LSF/lssched.h
LSF/bacct
LSF/badmin
LSF/bapp
LSF/bclusters
LSF/bconf
LSF/bgadd
LSF/bgmod
LSF/bhist
LSF/bhosts
LSF/bjobs
LSF/blaunch
LSF/blimits
LSF/bmod
LSF/bparams
LSF/bqueues
LSF/bresources
LSF/brestart
LSF/bresume
LSF/brsvadd
LSF/brsvmod
LSF/bslots
LSF/bsub
LSF/lsadmin
LSF/lsgrun
LSF/lshosts
LSF/lsinfo
LSF/lsload
LSF/lsloadadj
LSF/lslogin
LSF/lsmake
LSF/lsmake4
LSF/lsmon
LSF/lsplace
LSF/lsreghost
LSF/lsrun
LSF/bwait
LSF/lsid
LSF/cal_jobweight.so
LSF/libbat.a
LSF/libbat.so
LSF/liblsbstream.so
LSF/liblsf.a
LSF/liblsf.so
LSF/libptmalloc3.so
LSF/schmod_advrsv.so
LSF/schmod_affinity.so
LSF/schmod_aps.so
LSF/schmod_bluegene.so
LSF/schmod_cpuset.so
LSF/schmod_craylinux.so
LSF/schmod_crayx1.so
LSF/schmod_dc.so
LSF/schmod_default.so
LSF/schmod_demand.so
LSF/schmod_dist.so
LSF/schmod_fairshare.so
LSF/schmod_fcfs.so
LSF/schmod_jobweight.so
LSF/schmod_limit.so
LSF/schmod_mc.so
LSF/schmod_parallel.so
LSF/schmod_preemption.so
LSF/schmod_ps.so
LSF/schmod_pset.so
LSF/schmod_reserve.so
LSF/schmod_rms.so
LSF/schmod_xl.so
LSF/eauth.krb5
LSF/ebrokerd
LSF/ego_client
LSF/egosc
LSF/krbrenewd
LSF/lim
LSF/mbatchd
LSF/mbschd
LSF/mesub
LSF/nios
LSF/pim
LSF/res
LSF/rla
LSF/sbatchd
LSF/eauth.cve
LSF/libsec_ego_default.so
LSF/misc/examples/external_plugin/allocexample.c
LSF/misc/examples/external_plugin/Makefile
LSF/misc/examples/external_plugin/matchexample.c
LSF/misc/examples/external_plugin/myplugin.c
LSF/misc/examples/external_plugin/README
LSF/misc/examples/external_plugin/sch.mod.fcfs.c
LSF/resource_connector/aws/conf/awsprov_templates.json
LSF/resource_connector/aws/conf/awsprov_config.json
LSF/resource_connector/aws/conf/credentials
LSF/resource_connector/aws/lib/AwsTool.jar
LSF/resource_connector/aws/scripts/user_data.sh
LSF/resource_connector/aws/scripts/getAvailableMachines.sh
LSF/resource_connector/aws/scripts/getAvailableTemplates.sh
LSF/resource_connector/aws/scripts/getRequestStatus.sh
LSF/resource_connector/aws/scripts/getReturnRequests.sh
LSF/resource_connector/aws/scripts/requestMachines.sh
LSF/resource_connector/aws/scripts/requestReturnMachines.sh
LSF/resource_connector/openstack/scripts/Main.py
LSF/resource_connector/openstack/scripts/OpenStackClient.py
LSF/resource_connector/openstack/scripts/userscript.sh
LSF/resource_connector/openstack/scripts/MachineFile.py
LSF/resource_connector/ego/scripts/Main.py
LSF/resource_connector/policy/Main.py
LSF/resource_connector/policy/Log.py
LSF/resource_connector/policy/PolicyFile.py
LSF/util/elim.mic.ext/README
LSF/esub.p8aff(Only need on lnx3.10-glibc2.17-ppc64le)
Lnx310-lib217-x86_64
Lnx3.10-glibc2.17-ppc64le
badmin hclose all
badmin qinact all
badmin hshutdown all
lsadmin resshutdown all
lsadmin limshutdown all
lsadmin limstartup all
lsadmin resstartup all
badmin hstartup all
badmin mbdrestart
badmin hopen all
badmin qact all
badmin hclose all
badmin qinact all
badmin hshutdown all
lsadmin resshutdown all
lsadmin limshutdown all
lsadmin limstartup all
lsadmin resstartup all
badmin hstartup all
badmin mbdrestart
badmin hopen all
badmin qact all
fixlist.txt
lsbatch.h
lsf.h
lssched.h
bacct
badmin
bapp
bclusters
bconf
bgadd
bgmod
bhist
bhosts
bjobs
blaunch
blimits
bmod
bparams
bqueues
bresources
brestart
bresume
brsvadd
brsvmod
bslots
bsub
lsadmin
lsgrun
lshosts
lsinfo
lsload
lsloadadj
lslogin
lsmake
lsmake4
lsmon
lsplace
lsreghost
lsrun
bwait
lsid
cal_jobweight.so
libbat.a
libbat.so
liblsbstream.so
liblsf.a
liblsf.so
libptmalloc3.so
schmod_advrsv.so
schmod_affinity.so
schmod_aps.so
schmod_bluegene.so
schmod_cpuset.so
schmod_craylinux.so
schmod_crayx1.so
schmod_dc.so
schmod_default.so
schmod_demand.so
schmod_dist.so
schmod_fairshare.so
schmod_fcfs.so
schmod_jobweight.so
schmod_limit.so
schmod_mc.so
schmod_parallel.so
schmod_preemption.so
schmod_ps.so
schmod_pset.so
schmod_reserve.so
schmod_rms.so
schmod_xl.so
eauth.krb5
ebrokerd
ego_client
egosc
krbrenewd
lim
mbatchd
mbschd
mesub
nios
pim
res
rla
sbatchd
eauth.cve
libsec_ego_default.so
misc/examples/external_plugin/allocexample.c
misc/examples/external_plugin/Makefile
misc/examples/external_plugin/matchexample.c
misc/examples/external_plugin/myplugin.c
misc/examples/external_plugin/README
misc/examples/external_plugin/sch.mod.fcfs.c
packagedef.txt
resource_connector/aws/conf/awsprov_templates.json
resource_connector/aws/conf/awsprov_config.json
resource_connector/aws/conf/credentials
resource_connector/aws/lib/AwsTool.jar
resource_connector/aws/scripts/user_data.sh
resource_connector/aws/scripts/getAvailableMachines.sh
resource_connector/aws/scripts/getAvailableTemplates.sh
resource_connector/aws/scripts/getRequestStatus.sh
resource_connector/aws/scripts/getReturnRequests.sh
resource_connector/aws/scripts/requestMachines.sh
resource_connector/aws/scripts/requestReturnMachines.sh
resource_connector/openstack/scripts/Main.py
resource_connector/openstack/scripts/OpenStackClient.py
resource_connector/openstack/scripts/userscript.sh
resource_connector/openstack/scripts/MachineFile.py
resource_connector/ego/scripts/Main.py
resource_connector/policy/Main.py
resource_connector/policy/Log.py
resource_connector/policy/PolicyFile.py
util/elim.mic.ext/README
esub.p8aff(Only need on lnx3.10-glibc2.17-ppc64le)