Readme file for IBM® Spectrum LSF 10.1 fix 601504


Abstract

P104844. This fix supports the following enhancements to LSF:

·        LSF rate limiter

This fix is applicable for LSF Fix Pack 12 or later.

If you want to apply LSF 10.1 fix 601284, do so before applying this fix (601504). Otherwise, you will need to uninstall this fix and revert any configuration changes, and then install both fixes in order.

Description

Readme documentation for IBM® Spectrum LSF 10.1 fix 601504 including installation-related instructions, prerequisites and co-requisites, and list of fixes.

This fix provides the following new features and enhancements:

ID

Description

P104844

LSF now supports a new component, called the rate limiter, which prevents excessive requests to the mbatchd daemon.

Notes and restrictions for the rate limiter

Applications using the LSF batch library should be relinked with the updated library when rate limiter is enabled.


Context for the rate limiter

By default, all LSF batch commands contact the mbatchd daemon (or the mbatchd query child, if configured). When there are excessive requests, such as scripts with tight loop running bjobs commands, mbatchd can become overloaded, negatively affecting cluster performance. To protect mbatchd from heavy requests, a new rate limiter LSF component, lsfproxyd, is introduced to act as a gate keeper between the commands and the mbatchd daemon.

If rate limiting is configured, the lsfproxyd daemon controls the number of requests that can reach the mbatchd daemon. A request must first obtains a token from lsfproxyd to be able to contact the mbatchd. After the request finishes, the token is returned to the lsfproxyd. The lsfproxyd daemon distributes tokens in round robin fashion among users, so that each user has a fair chance to get their requests served, even under heavy request loads.

The lsfproxyd policy is governed by three attributes: max, nominal, and throttle. See the new PROXYD_POLICIES added to the lsb.params configuration file for details. The attribute max is the maximum number of tokens available for distribution. The attributes nominal and throttle work together to throttle client requests. The attribute nominal is a threshold value. If the number of tokens currently in use is below this value, lsfproxyd will distribute tokens as fast as possible; otherwise, lsfproxyd will wait the throttle value, in milliseconds, for each request before granting a new token. The policies are individually configured for the three categories (query, submission, and other). The query category includes all query-related requests. The submission category includes submission-related requests: bsub, brestart, bmod, and bswitch. The other category includes all other requests.

Once a token in use reaches its max value, a request (from the same category) from an ordinary user (that is, non-root or non-cluster administrator) will not be granted a token. After a certain number of retries (configured by LSF_PROXYD_NOTOKEN_NTRIES in the lsf.conf configuration file) the command will then not contact mbatchd and will fail with an error message. Requests from root or LSF cluster administrators are exempt from this policy. In other words, requests from root or LSF cluster administrators will always be granted a token. Their requests will still be counted towards the tokens currently in use. 

When a token is granted for a particular category, the in-use count is incremented; when a token is returned, the in-use count is decremented. It is possible for the in-use count to exceed the max value because no constraints are placed on root and LSF cluster administrators.

Configuring the rate limiter

The rate limiter component is enabled by configuring the two new LSF_PROXYD_HOSTS and LSF_PROXYD_PORT parameters in the lsf.conf configuration file:

·        LSF_PROXYD_HOSTS - Defines hosts to run lsfproxyd.
When multiple lsfproxyd hosts are defined, they work together to balance workload and provide high availability. The hosts need be LSF server hosts in the cluster.

Default value: undefined

Example: LSF_PROXYD_HOSTS="hostA hostB"

Valid values: Any valid LSF server host name.

·        LSF_PROXYD_PORT - Port number for the lsfproxyd daemon.

Default value: undefined

Example: LSF_PROXYD_PORT=1234

Valid values: Any positive integer.

The lsfproxyd daemon is started by the lim daemon. Restart lim on all the hosts mentioned in your old and new LSF_PROXYD_HOSTS parameters (by running lsadmin limrestart) for the configuration to take effect. Also, when changing the LSF_PROXYD_HOSTS parameter, the mbatchd daemon must be restarted (by running badmin mbdrestart).


Enhancements to the badmin command

The badmin command now recognizes the new lsfproxyd keyword:

badmin lsfproxyd [ enable | disable ] [ all | query | sub | other ]
badmin lsfproxyd status
badmin lsfproxyd [ [ block | unblock ] [ all | [ -u "user1 user2 ..." ] [ -m "host1 host2 ..." ] ] ] | blocklist

Running badmin lsfproxyd enable|disable all|query|sub|other enables or disables the limiter functionality while the lsfproxyd daemon continues to run. If all is specified, the rate limiter is enabled or disabled for all request types. If query is specified, then only query requests will be affected. If sub is specified, then only submission requests will be affected. Finally, if other is specified, requests not considered a query or submission request will be affected.  If a request type is disabled, lsfproxyd will not distribute tokens for that request type, and the requests will proceed to mbatchd instead. If lsfproxyd is disabled (or the lsfproxyd daemons are all down), queries without tokens will be accepted by mbatchd:

$ badmin lsfproxyd disable query
lsfproxyd service status:
  QUERY:DISABLED
  SUBMISSION:ENABLED
  OTHER:ENABLED

Running badmin lsfproxyd status displays whether the different request types are enabled. The status information also shows if the lsfproxyd is connected to the mbatchd, its share of the token limit, how many tokens are in use, and how many tokens are in use by privileged users (root or cluster administrators). Also, metrics are reported showing requests where tokens were granted, rejected, blocked, or an error occurred. Metrics are displayed for the last (60 second) reporting period, the maximum seen during any reporting period, and the average since the lsfproxyd started:

$ badmin lsfproxyd status

lsfproxyd service status:
  QUERY: ENABLED
  SUBMISSION: ENABLED
  OTHER: ENABLED

lsfproxyd host status:
  HOSTNAME: revues1
  STATUS: CONNECTED
  PID: 3803586
             TOKEN_LIMIT  TOKENS_IN_USE_TOTAL  TOKENS_IN_USE_PRIVILEGED
  QUERY      100          20                   0
  SUBMISSION 100          0                    0
  OTHER      100          0                    0

lsfproxyd started:              Fri Mar 10 07:38:56
End time of last sample period: Mon Mar 13 06:26:56
Sample period:                  60 Seconds
------------------------------------------------------------------------------
Metrics                                           Last         Max         Avg
------------------------------------------------------------------------------
Requests
     Query                                          15          25          15
     Submission                                      1           1           1
     Other                                           1           1           1
Rejected
     Query                                           0           0           0
     Submission                                      0           0           0
     Other                                           0           0           0
Blocked                                              0           0           0
Error                                                0           0           0

When using the rate limiter, running badmin lsfproxyd block allows an administrator to temporarily block non-administrators and non-root users, hosts, or both, from requests to the mbatch daemon. Administrators can run this command to temporarily stop abusive or misbehaving users from interacting with the LSF cluster, and to avoid performance impact on other users.

Example of blocking all users and hosts:
$ badmin lsfproxyd block all
<all> added to the blocklist on lsfproxyd host <lsfproxydhost1>

Example of removing all from a blocklist:
$ badmin lsfproxyd unblock all
<all> removed from the the blocklist on lsfproxyd host <lsfproxydhost1>

Example of blocking user1 and user2:
$ badmin lsfproxyd block -u "user1 user2"
Users <user1 user2> added to the blocklist on lsfproxyd host <lsfproxydhost1>

Example of blocking hostA and hostB:

$ badmin lsfproxyd block -m "hostA hostB"
Hosts <hostA hostB> added to the blocklist on lsfproxyd host <lsfproxydhost1>

Example of blocking user1 at hostA:

$ badmin lsfproxyd block -u "user1" -m "hostA"
<user1@hostA> added to the blocklist on lsfproxyd host <lsfproxydhost1>
<user1@hostA> added to the blocklist on lsfproxyd host <lsfproxydhost2>

Example of blocking user1 and user2, at hostA and hostB:

$ badmin lsfproxyd block -u "user1 user2" -m "hostA hostB"
<user1@hostA user1@hostB user2@hostA user2@hostB> added to the blocklist on lsfproxyd host <lsfproxydhost1>
<user1@hostA user1@hostB user2@hostA user2@hostB> added to the blocklist on lsfproxyd host <lsfproxydhost2>

Example of unblocking user1:

$ badmin lsfproxyd unblock -u user1
Users <user1> removed from the blocklist on lsfproxyd host <lsfproxydhost1>
Users <user1> removed from the blocklist on lsfproxyd host <lsfproxydhost2>

Example of unblocking user1 at hostA and hostB:

$ badmin lsfproxyd unblock -u "user1" -m "hostA hostB"
<user1@hostA user1@hostB> removed from the blocklist on lsfproxyd host <lsfproxydhost1>
<user1@hostA user1@hostB> removed from the blocklist on lsfproxyd host <lsfproxydhost2>

Example to see a summary of who is currently blocked:

$ badmin lsfproxyd blocklist
lsfproxyd host - host1
  Blocked all:           No
  Blocked users:         user1 user2
  Blocked hosts:         -
  Blocked users@hosts:   user4@exechost1 user3@exechost2

lsfproxyd host - host2
  Unable to contact <host2>

lsfproxyd host - host3
  Blocked all:           No
  Blocked users:         user1 user2
  Blocked hosts:         -
  Blocked users@hosts:   user4@exechost1 user3@exechost2

The blocklist is not persisted: it is cleared on lsfproxyd restart. The lsfproxyd daemons that does not receive the blocklist update (due to being down, or restarts after the update) will not have a blocklist.

Each block command is treated as a rule. Unblocking needs to match one of the blocking rules to remove it, if in a sequence. For example, in this sequence of block commands:

 

  1. Block user1@host1, user2@host1
  2. Block user1
  3. Unblock user1

 

Rule 3 only removes rule 2, not rule 1, so that user1@host1 is still blocked.

 

By the same logic, unblock all will only remove block all and leave everything else there.


Diagnostic logging for the rate limiter

If the rate limiter is enabled, when ENABLE_DIAGNOSE=lsfproxyd is configured in the lsb.params configuration file, or when lsfproxyd diagnosing is enabled with the command badmin diagnose -c lsfproxyd with its options, lsfproxyd will log all requests to a diagnostic log file with a default file name of info.lsfproxydlog.<hostname> (similar to the diagnostic log file generated by mbatchd). The content of the lsfproxyd diagnostic log, however, is slightly different from the mbatchd’s logging file.

Here is an example of an lsfproxyd diagnostic log file:

Feb 8 07:57:12 2023 QUERY_REQ,BATCH_JOB_QUERY,user1,host1,184,0x1A,1,075423,075426

Feb 8 07:57:13 2023 QUERY_REQ,BATCH_HOST_INFO,user1,host1,6624,-,1,075424,075450

Feb 8 07:57:15 2023 QUERY_REQ,BATCH_JOB_QUERY,user1,host1,184,0x1A,1,075530,075535

Feb 8 07:57:16 2023 QUERY_REQ,BATCH_HOST_INFO,user1,host1,6624,-,0,075520,075620

Feb 8 07:57:36 2023 QUERY_REQ,BATCH_HOST_INFO,user1,host1,6624,-,0,075540,075640

Feb 8 07:57:41 2023 SUBMISSION_REQ,BATCH_JOB_SUB,user1,host1,36,1,075541,075542
Feb 8 07:58:08 2023 OTHER_REQ,
BATCH_REMOVE_RSV,user1,host1,88,0,075618,075618

 

Compare mbatchd’s diagnostic logging format:

DATE TIME YEAR COMMAND,USER,HOSTNAME,SIZE,OPTION

 

with lsfproxyd’s diagnose logging format, which includes captures more information:

DATE TIME YEAR CATEGORY,BATCH_OPCODE,USER,HOSTNAME,SIZE,ACCEPT, RECEIVE TIME,PROCESS TIME

where:

·        CATEGORY: The category (QUERY_REQ, SUBMISSION_REQ, or OTHER_REQ).

 

·        BATCH_OPCODE: The batch OPCODE.

 

·        ACCEPT: Whether the token was accepted or rejected: 1 for accept, 0 for reject

 

·        RECEIVE TIME: Time in HHMMSS format, when the request was received by lsfproxyd.

 

·        PROCESS TIME: Time in HHMMSS format, when the request was processed by lsfproxyd.


Additional new parameters in the lsf.conf file for the rate limiter

In addition to the required configuration for rate limiter enablement and logging, you can also set several optional parameters in the lsf.conf configuration file:

·       LSF_DEBUG_PROXYD - Sets the log class for debugging lsfproxyd.

Default value: undefined

Example: LSF_DEBUG_PROXYD="LC_TRACE"

Restart lim where the lsfproxyd is located (by running lsadmin limrestart) for the configuration to take effect.

·       LSF_PROXYD_BYPASS - Determines how mbatchd responds to query requests without tokens.

If this parameter is not configured or set to N, mbatchd will reject query requests without a token.

When it is set to Y, mbatchd will accept requests without a token. This parameter is used to support a transition period for users to update their applications that are linked with LSF library. LSF commands from older versions or user applications linked with old LSF libraries will not contact lsfproxyd, but will directly contact mbatchd without a token. When rate limiter is enabled but LSF_PROXYD_BYPASS is set to Y, these commands or applications will still work but with warning messages in the mbatchd log to indicate commands with no token. See the LSB_DEBUG_MBD parameter for details. After the transition period is over, an LSF administrator should remove this parameter or set it to N, to fully enforce rate limiter.

Default value: N|n

Example: LSF_PROXYD_BYPASS=Y

Valid values: Y|y|N|n

Restart mbatchd (by running badmin mbdrestart) for the configuration to take effect.

·        LSF_PROXYD_HEARTBEAT_INTERVAL - Sets the frequency that lsfproxyd sends a heartbeat message to mbatchd.

Default value: 30 seconds

Example: LSF_PROXYD_HEARTBEAT_INTERVAL=25

Valid values: Any positive integer

Restart lim where the lsfproxyd is located (by runninglsadmin limrestart), and restart the mbatchd (by running badmin mbdrestart) for the configuration to take effect.

·       LSF_PROXYD_TOKEN_WAIT_TIME - The amount of time that a token request can age before it is considered too old.

When there are heavy requests, ordinary user requests that reach this age are explicitly denied a token. They will not contact the mbatchd daemon.

Default value: 15 seconds

Example: LSF_PROXYD_TOKEN_WAIT_TIME=10

Valid values: Any positive integer

Restart lim where the lsfproxyd is located (by running lsadmin limrestart) for the configuration to take effect.

·        LSF_ADDON_HOSTS - Any requests for tokens received from the specified hosts will be treated as privileged requests. A request for a token will be granted regardless of the current token in use count unless the user or host has been explicitly blocked.

Default value: undefined

Example: LSF_ADDON_HOSTS="hosta hostb"

Valid Values: A space separated list of hostnames

·        LSF_PROXYD_NOTOKEN_NTRIES – Number of times that LSF command attempts to contact the lsfproxyd daemon after it is not granted a token. If this parameter is set to a value, LSF only attempts to contact lsfproxyd the defined number of times and then quits.

Default value: infinite

Example: LSF_PROXYD_NOTOKEN_NTRIES=3

Valid values: Any positive integer.

 

Existing parameters in the lsf.conf file for the rate limiter
To use the new rate limiter component, understand how the LSB_DEBUG_MBD parameter interacts with the rate limiter:

·        LSB_DEBUG_MBD - Sets the debugging log class for mbatchd.

Specifies the log class filtering to be applied to mbatchd. Only messages belonging to the specified log class are recorded. If set to LSB_DEBUG_MBD="LC2_LSF_PROXYD" mbatchd will log information about query requests it has received without a token. This is logged at log level LOG_WARNING. The information logged will include the user, the request opcode, the LSF header version, and the host from where the request originated. Restart mbatchd to make the parameter take effect. This can be useful in identifying the use of old binaries.


New parameters in the lsb.params file for the rate limiter
To use the new rate limiter component, set the new PROXYD_POLICIES parameter in the lsb.params file:

·        PROXYD_POLICIES - Specifies the max and nominal tokens, and the throttle value when lsfproxyd is configured.

Syntax: PROXYD_POLICIES="QUERY[ max=maximum_query_tokens nominal=nominal_query_tokens throttle=query_throttle_time ] SUBMISSION[ max=maximum_submission_tokens nominal=nominal_submission_tokens throttle=submission_throttle_time ] OTHER[ max=maximum_other_tokens nominal=nominal_other_tokens throttle=other_throttle_time ]"

o   max - The maximum number of tokens used for specified requests.

Default value: 100 (if max is not defined and nominal is defined to more than 100, then the max value will be set to the nominal value).

Valid values: For QUERY, any positive integer less than or equal to the MAX_CONCURRENT_QUERY value; otherwise, any positive integer.

o   nominal - If the in-use tokens is below this threshold value, tokens are granted as quickly as possible.

Default value: 50 (if max is defined to be less than 50 and nominal is not defined, then the nominal value will be set to the max value).

Valid values: Any positive integer less than or equal to the max value.

o   throttle - If the in-use tokens has reached the nominal value, lsfproxd will wait this throttle value in milliseconds, before granting another token.

Default value: 1 millisecond

Valid values: Any positive integer in milliseconds.

The max and nominal values will be divided evenly among the running lsfproxyd daemons in the cluster.

After changing
PROXYD_POLICIES, restart mbatchd (by running badmin mbdrestart) or reconfigure mbatchd (by running badmin reconfig).

 

 

Readme file for: IBM® Spectrum LSF

Product or component release: 10.1

Update name: Fix 601504

Fix ID: LSF-10.1-build601504

Publication date: 26 March 2023


Contents

1. List of fixes

2. Download location

3. Product or components affected

4. System requirements

5. Installation and configuration

6. List of files

7. Product notifications

8. Copyright and trademark information

 

1. List of fixes

P104844

 

2. Download locations

Download fix 601504 from the following location: https://www.ibm.com/support/fixcentral

 

3. Product or components affected

Affected product or components in this fix reflect the list of enhancements previously described:

lim 
lsfproxyd 
mbatchd 
mbschd 
nios 
res 
sbatchd
ebrokerd 
bacct 
badmin 
bapp 
battach 
battr 
bbot 
bchkpnt 
bclusters 
bconf 
bctrld 
bdc 
bentags 
bgadd 
bgdel 
bgmod 
bgpinfo 
bhist 
bhosts 
bhpart 
bimages 
bjdepinfo 
bjgroup 
bjobs 
bkill 
blimits 
bmg 
bmgroup 
bmig 
bmod 
bparams 
bpeek 
bpost 
bqc 
bqueues 
bread 
breboot 
breconfig 
brequeue 
bresize 
bresources 
brestart 
bresume 
brsvadd 
brsvdel 
brsvjob 
brsvmod 
brsvs 
brsvsub 
brun 
bsla 
bslots 
bstatus 
bstop 
bsub 
bswitch 
btop 
bugroup 
busers 
bwait
lshosts
lsload
liblsf.a
liblsf.so
libbat.a 
libbat.so
liblsbstream.a  
liblsbstream.so
lsbatch.h
lsf.h

 

 

4. System requirements

linux2.6-glibc2.3-x86_64

linux3.10-glibc2.17-x86_64

 

5. Installation and configuration


Before you install

LSF_TOP is the full path to the top-level installation directory of LSF.

1.      You must have LSF 10.1 Fix Pack 12 or later installed before you install this fix. Download this fix pack from IBM Fix Central (https://www.ibm.com/support/fixcentral) and search for build600488 (Fix Pack 12) or build601088 (Fix Pack 13).

2.      Starting in LSF 10.1 Fix Pack 13, the default values of the following three GPU parameters are changed to:
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSF_GPU_RESOURCE_IGNORE=Y

If you have Fix Pack 13 installed, and these three GPU parameters are not configured in the lsf.conf configuration file, they will take the default values, and the parameters already configured in the lsf.conf file will not be affected.

If you want to keep the former GPU behavior, and if any of the three parameters are missing in the lsf.conf configuration file, you must explicitly configure the following default settings that are defined in Fix Pack 12 or earlier:
LSF_GPU_AUTOCONFIG=N
LSB_GPU_NEW_SYNTAX=N
LSF_GPU_RESOURCE_IGNORE=N

3.      Log on to the LSF management host as the LSF primary administrator.

4.      Set the LSF cluster environment:
- For csh or tcsh: % source LSF_TOP/conf/cshrc.lsf
- For sh, ksh, or bash: $ . LSF_TOP/conf/profile.lsf

5.      Run badmin hclose all

6.      Run badmin qinact all

Installation steps

1.      Log on to the LSF management host as root and set the LSF cluster environment.

2.      Go to the installation directory for the fix: cd $LSF_ENVDIR/../10.1/install/

3.      Copy the fix file to the installation directory $LSF_ENVDIR/../10.1/install/

4.  Run the patchinstall command: ./patchinstall <fix>


After you install

1.      Log on to the LSF management host as the LSF primary cluster administrator and set the LSF cluster environment.

2.      Run lsadmin limrestart all

3.      Run lsadmin resrestart all

4.      Run badmin hrestart all

5.      Run badmin mbdrestart

6.      Run badmin hopen all

7.      Run badmin qact all


Uninstallation

1.      Log on to the LSF management host as the LSF primary cluster administrator and set the LSF cluster environment.

2.      Run badmin hclose all

3.      Run badmin qinact all

4.      Log on to the LSF management host as root and set the LSF cluster environment.

5.      Go to install directory of the fix: cd $LSF_ENVDIR/../10.1/install/

6.      Run the patchinstall command: ./patchinstall -r <fix>

7.      Log on to the LSF management host as the LSF cluster primary administrator and set the LSF cluster environment.

8.      Run lsadmin limrestart all

9.      Run lsadmin resrestart all

10.   Run badmin hrestart all

11.   Run badmin mbdrestart

12.   Run badmin hopen all

13.   Run badmin qact all

 

6. List of files

lim 
lsfproxyd 
mbatchd 
mbschd 
nios 
res 
sbatchd
ebrokerd 
bacct 
badmin 
bapp 
battach 
battr 
bbot 
bchkpnt 
bclusters 
bconf 
bctrld 
bdc 
bentags 
bgadd 
bgdel 
bgmod 
bgpinfo 
bhist 
bhosts 
bhpart 
bimages 
bjdepinfo 
bjgroup 
bjobs 
bkill 
blimits 
bmg 
bmgroup 
bmig 
bmod 
bparams 
bpeek 
bpost 
bqc 
bqueues 
bread 
breboot 
breconfig 
brequeue 
bresize 
bresources 
brestart 
bresume 
brsvadd 
brsvdel 
brsvjob 
brsvmod 
brsvs 
brsvsub 
brun 
bsla 
bslots 
bstatus 
bstop 
bsub 
bswitch 
btop 
bugroup 
busers 
bwait
lshosts
lsload
liblsf.a
liblsf.so
libbat.a 
libbat.so
liblsbstream.a  
liblsbstream.so
lsbatch.h
lsf.h

 

7. Product notifications

To receive information about product solution and fix updates automatically, subscribe to product notifications on the My notifications page (www.ibm.com/support/mynotifications) on the IBM Support website (support.ibm.com). You can edit your subscription settings to choose the types of information you want to get notification about, for example, security bulletins, fixes, troubleshooting, and product enhancements or documentation changes.

8. Copyright and trademark information

©Copyright IBM Corporation 2023

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM®, the IBM logo, and ibm.com® are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.