- LoadLeveler 5.1.0.x for AIX 7
For LL 5.1.0:
- If the scheduler and resource manager components on the same machine are not at the same level, the daemons will not start up.
- Please refer to the "Known Limitations" section under the fix pack README for more limitation information for this release.
Problems fixed in LoadLeveler 5.1.0.8 [November 1, 2012]
- LoadLeveler takes the accurate expected time to start the alternate central manager daemon now.
- LoadLeveler now shows correct value of ConsumableCpus when machine group is configured.
- There is no useless message displayed in central manager log when querying the job by command "llq -x" if any machine group is defined.
- LoadLeveler job query commands will now return the correct "Step Cpus" value for the running job that requires ConsumableCpus in the node_resources keyword.
- The llqres command will now display the bg_block information for a Blue Gene reservation.
- LoadLeveler will not core dump when the user issues the command "llstatus -L machine".
- Resource Manager only:
- The handling of hierarchical communication errors is restored to the prior release behavior. Such failures will cause jobs to be "rejected" allowing the jobs to be controlled by the the MAX_JOB_REJECTS and MAX_JOB_ACTION configurables.
- The Region Manager has been modified to ignore all adapters on the same subnet as the adapter that was filtered out with adapter_list. Instead of the Region Manager marking those adapters down, those adapters will remain in an HB_UNKNOWN state.
- The LOADL_HOSTFILE environment variable will be set in the environment of the job prolog and the user environment prolog.
- Obsolete code which attempted to terminate left over job processes is removed.
- Scheduler only:
- LoadLeveler scheduler will ignore any floating resource requirement with a 0 value.
- The accounting record which has a negative wall clock value is now skipped by the llsummary command.
- The problem of the central manager hanging soon after being restarted has been fixed.
- The central manager will not crash with SIGABT when removing a step.
Problems fixed in LoadLeveler 5.1.0.7 [July 20, 2012]
- The problem that the schedd is unable to determine the protocol versions for the nodes allcoated to a job step it is trying to dispatch has been fixed.
- The region manager failover and recovery code is changed to ensure that the resource manager is notified when a region manager becomes active which makes all active nodes and adapters available for scheduling.
- The resource manager daemon will not crash once startup LL if set D_FULLDEBUG for RESOURCE_MGR_DEBUG in LoadL_config file.
- Scheduler only:
- LoadLeveler negotiator dameon will not core dump if a llmovespool command is run when there is a multistep job which has some steps completed and others still running.
- LoadLeveler will not display the misleading message about image_size check in the command "llq -s" and in the Negotiator log when logic for determining that a machine could not be used was found already.
Problems fixed in LoadLeveler 5.1.0.6 [June 28, 2012]
- The CAU value is not always allocated correctly for all the nodes on which the job is run is now fixed.
- Resource Manager only:
- If the cluster does not have ethernet configured, then the job will stay in the "ST" state and not run. The default network support will now use the adapter associated with the hostname with which the machine is configured in the administration file. If there is no network statement in the job command file, then the default network is used, which assumes ethernet.
- The llstatus command shows the startds to be up even though the llrstatus command shows the startd and the region manager they report to is actually down. The central manager will now be notified by the resource manager when the startd is marked as down by the resource manager so the llstatus command will now show the correct output state as the llrstatus command.
- Fixed memory leaks in the startd and schedd daemons
- Scheduler only:
- A reservation that reserves a 0 count of some floating resource can lead to a central manager core dump. The list of reserved resources was not being updated properly when the reservation requesting a 0 count ended, leading to the core dump. That reservation list is now being being updated correctly in all cases.
- The central manager will core dump when there is a bind job to a reservation with floating resources. Internal vector is fixed so that the central manager daemon will not core dump when a job is bind to a reservation with floating resources.
Problems fixed in LoadLeveler 5.1.0.5 [April 5, 2012]
- Resource Manager only:
- A problem in pe_rm_connect() that caused read() to be called on a socket that was not ready to be read has been corrected, allowing pe_rm_connect() to continue to retry to the connection for the specified rm_timeout amount of time.
Problems fixed in LoadLeveler 5.1.0.4 *Superceded by LoadLeveler 5.1.0.5*
- LoadLeveler can now display the host name correctly based on the name_server configuration. The previous limitation of the name_server keyword being ignore is now lifted.
- The LoadLeveler negotiator and resource manager now update the number of ConsumableCpus resource correctly based on the actual number of cpus reported from the compute node so that jobs will not be prevented from being scheduled due to zero number of ConsumableCpus resources.
- Rsets are assigned to job processes only if rsets or affinity is requested so that jobs that do not request affinity or rset the bindprocess function will not fail.
- Fixed potential negotiator daemon hang due to a deadlock situation where user might see the llstatus command returning 2512-301 message.
- Resource Manager only:
- Locking is added to the LoadLeveler schedd daemon to serialize threads receiving multi-cluster jobs from threads processing llq -x requests to prevent the daemon from core dumping.
- The LoadLeveler schedd daemon will now write the host smt status to the accounting history file before the job gets terminated so that all the host smt status will be shown in the llsummary -l output.
- The LoadLeveler method for reporting job step status has been corrected to report R state, even for parallel jobs which do not invoke an mpi run time manager (e.g. poe). Otherwise the job step will be shown as stuck in ST state even though it is actually running.
- LoadLeveler was modified to set the correct userid to prevent checkpoint files from being deleted and the correct checkpoint file is being read by the starter.
- Locking is added to the LoadLeveler schedd daemon to serialize threads receiving multi-cluster jobs from threads processing llq -x requests.
Problems fixed in LoadLeveler 5.1.0.3 [January 25, 2012]
- Enhanced llconfig to add in command line support for stanzas.
- Enable support for checkpoint/restart with APAR IV11747 and IV11748. For more information, see PPE.RTE APAR IV11749 documentation.
- The environment variable LOADL_JOB_TRACE can be used to trace the job life cycle for a job step.
- LoadLeveler will ignore the machine_list keyword if the syntax is not defined correctly.
- Fixed region manager crash when trying to remove invalid MCMs entries.
- LoadLeveler has been changed to prevent unnecessary logging of multi-cluster messages to the Schedd log.
- LoadLeveler llstatus -L machine command can now report the correct number of cpus for active compute nodes once the machine information was correctly forwarded to the daemons.
- Recovery from network table load failures has been restored to LoadLeveler. LoadLeveler will attempt to clean adapter windows when it fails to load a network table. If the clean operations succeed, LoadLeveler will attempt a second network table load.
- LoadLeveler will fence off failure adapters so that the scheduler will get the correct number of usable adapters for the node.
- Resource Manager only:
- The LOADL_HOSTFILE environment variable is now set in the environment for the user prolog when the job type is set to mpich.
- Scheduler only:
- The step count limitation is now calculated correctly for the user for each of its class when the job step was modified by the llmodify command.
Problems fixed in LoadLeveler 5.1.0.2 [December 6, 2011]
- Collective Acceleration Unit (CAU) groups is now supported.
- Fixed llstatus -l command to show the correct status for tasks running on the startd column.
- Fixed resource manager API to include LLR_StepGetStepResourceRequirementList specification for getting Step Resource Requirement values.
- LoadLeveler will now select and hold cpus that are already in used for top dog usage; therefore, other jobs can now run with cpus that are currently available.
- The llsubmit command will fail if the smt and rset keywords are used together.
- The processing of the preempt_class configuration keywords has been fixed so that changes will take effect after the llctl reconfig command is issued.
- The Negotiator has been changed so that it no longer depends on processor core 0 having CPUs configured. The Negotiator will no longer core dump if it encounters such a configuration.
- The memory error in the LoadLeveler String library is corrected to prevent daemon crashes if the function is used.
- The LoadLeveler commands will not generate the 2512-030 error message when there is no /etc/LoadL.cfg file on the system.
- The informational message written to stderr when llsetpenv cannot chdir to the user's home directory will be removed so POE will not hang after reading the unexpected data from the stderr socket.
- Fixed the llconfig -i failure if there are more than 80 characters set for the include_users or the exclude_users field in the database.
- Fixed lldbupdate failing with 2544-010 error message under database configuration if there are missing xCAT tables
- When using database configuration, fixed potential core dump in LoadLeveler if the connection pool exceeds 10 connections.
- LoadLeveler is now able to support ETHoIB using bond0 interface mapped to IB User Space device on linux system if the fileset rsct.lapi.rte apar IV06393 is also applied.
- In a multi-cluster environment with database configuration, LoadLeveler fails to start with a 2544-004 Fetching TLL_Cluster was not successful message. LoadLeveler will now check for the clusterID instead of default_cluster since multi-cluster does not have a default_cluster.
- Fixed a hang when executing the llq -x -l on a job with the step_resources keyword on the lower release in a multi-cluster environment.
- Resource Manager only:
- The LoadL_startd daemon may leave behind job status and job usage files in the execute directory after the job step has terminated. The files were left behind because in some situations the Startd was not using the correct effective user ID when trying to remove the files. The Startd has been fixed to ensure that the correct effective user ID is used when cleaning up job status and job usage files in the execute directory during job termination.
- A job step might became stuck in pending state after it encountered a failure to load the network table on a number of nodes due to the timing of events where the startd lost at least one hierarchical status update. The solution is to process any pending status at the time the job step is removed from the starter table.
- The startd daemon abort is now prevented by correcting the startd daemon locking when processing files in the execute directory during startup.
Problems fixed in LoadLeveler 5.1.0.1 [August 26, 2011]
- LoadLeveler will not submit the job if there are no class in the default class list that can satisfy the job requirements.
- The unthread_open() error in the Schedd Log will no longer be printed when querying the remote cluster job since LoadLeveler will no longer try to route a nonexistent remote submit job command file in a multi cluster environment.
- LoadLeveler can now handle jobs from users who belong to more than 64 system groups.
- Fix the LoadL_schedd from core dumping during the spool file compression function by verifying that the data read from the spool is valid before attempting to write to the new spool file.
- Fix LoadL_startd memory core dump due to memory leak issue.
- Resource Manager only:
- Fix resource manager from crashing if machine group is configured and scheduler fileset is not installed.
- LoadLeveler daemons will now be able to find out where the alternate Resource Manager is running when they are started after the failover has taken place.
- Scheduler only:
- LoadLeveler "llq -s" command will provide information about why a step is in Deferred state.
- The llsummary command and API will no longer core dump if the number of history files are greater than or equal to the PTHREAD_DATAKEYS_MAX constant value.
Copyright and trademark information
http://www.ibm.com/legal/copytrade.shtml
Notices
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE. Some jurisdictions do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this
statement may not apply to you.
This information could include technical inaccuracies or
typographical errors. Changes are periodically made to the
information herein; these changes will be incorporated in new
editions of the publication. IBM may make improvements and/or
changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Microsoft, Windows, and Windows Server are trademarks of Microsoft
Corporation in the United States, other countries, or both.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino,
Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and
Pentium are trademarks or registered trademarks of Intel Corporation or its
subsidiaries in the United States and other countries.
Other company, product, or service names may be trademarks or
service marks of others.
THIRD-PARTY LICENSE TERMS AND CONDITIONS, NOTICES AND INFORMATION
The license agreement for this product refers you to this file for
details concerning terms and conditions applicable to third party
software code included in this product, and for certain notices
and other information IBM must provide to you under its license
to certain software code. The relevant terms and conditions,
notices and other information are provided or referenced below.
Please note that any non-English version of the licenses below is
unofficial and is provided to you for your convenience only. The
English version of the licenses below, provided as part of the
English version of this file, is the official version.
Notwithstanding the terms and conditions of any other agreement
you may have with IBM or any of its related or affiliated entities
(collectively "IBM"), the third party software code identified
below are "Excluded Components" and are subject to the following
terms and conditions:
- the Excluded Components are provided on an "AS IS" basis
- IBM DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED WARRANTIES AND CONDITIONS WITH RESPECT TO THE EXCLUDED COMPONENTS, INCLUDING, BUT NOT LIMITED TO, THE WARRANTY OF NON-INFRINGEMENT OR INTERFERENCE AND THE IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
- IBM will not be liable to you or indemnify you for any claims related to the Excluded Components
- IBM will not be liable for any direct, indirect, incidental, special, exemplary, punitive or consequential damages with respect to the Excluded Components.