- LoadLeveler 3.5.1.x for AIX 6
- LoadLeveler 3.5.1.x for AIX 5
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 11 (SLES11) on POWER servers
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 10 (SLES10) on POWER servers
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 9 (SLES9) on POWER servers
- LoadLeveler 3.5.1.x for Red Hat Enterprise Linux 5 (RHEL5) on POWER servers
- LoadLeveler 3.5.1.x for Red Hat Enterprise Linux 4 (RHEL4) on POWER servers
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 11 (SLES11) on Intel based servers
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 10 (SLES10) on Intel based servers
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 9 (SLES9) on Intel based servers
- LoadLeveler 3.5.1.x for Red Hat Enterprise Linux 5 (RHEL5) on Intel based servers
- LoadLeveler 3.5.1.x for Red Hat Enterprise Linux 4 (RHEL4) on Intel based servers
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 11 (SLES11) on servers with 64-bit Opteron or EM64T processors
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 10 (SLES10) on servers with 64-bit Opteron or EM64T processors
- LoadLeveler 3.5.1.x for SUSE LINUX Enterprise Server 9 (SLES9) on servers with 64-bit Opteron or EM64T processors
- LoadLeveler 3.5.1.x for Red Hat Enterprise Linux 5 (RHEL5) on servers with 64-bit Opteron or EM64T processors
- LoadLeveler 3.5.1.x for Red Hat Enterprise Linux 4 (RHEL4) on servers with 64-bit Opteron or EM64T processors
A coexistence problem was introduced in TWS LoadLeveler 3.5.0.5 and TWS LoadLeveler 3.5.1.1 which can not be corrected. The entire cluster will need to be migrated to either TWS LoadLeveler 3.5.0.5 or TWS LoadLeveler 3.5.1.1 at the same time.
- TWS LoadLeveler 3.5 does not support checkpointing for data staging jobs.
Problems fixed in LoadLeveler 3.5.1.19 [October 19, 2012]
- LoadLeveler is modified to set the correct userid to prevent checkpoint files from being deleted and the correct checkpoint file is being read by the starter.
- LoadLeveler Schedd daemon crash problem is fixed when accessing fairshare data when the job is terminating.
- A performance optimization of the central manager is done.
- The central manager will not schedule other steps of a co-scheduled job if one of them fails to schedule.
- The central manager crash problem with SIGABT signal when removing step is fixed.
Problems fixed in LoadLeveler 3.5.1.18 [July 12, 2012]
- LoadLeveler will not display the misleading message about image_size check in the command "llq -s" and in the Negotiator log for determining that a machine could not be used was found already.
- LoadLeveler will now assign the correct number of cpus for blocking steps requesting cpu affinity.
- LoadLeveler negotiator dameon will not core dump if a llmovespool command is run when there is a multistep job which has some steps completed and others still running.
Problems fixed in LoadLeveler 3.5.1.17 [May 3, 2012]
- Under some rare conditions, the LoadL_schedd daemon can core dump when a job is rejected multiple times. The core dump was the result of an array index not being reset properly upon a 2nd dispatch of the same job step. This problem has been corrected by setting that array index back to -1 when a job step is redispatched.
Problems fixed in LoadLeveler 3.5.1.16 [February 23, 2012]
- The step count limitation is now calculated correctly for the user for each of its class if the step was modified by the llmodify command.
Problems fixed in LoadLeveler 3.5.1.15 [December 15, 2011]
- The LOADL_HOSTFILE environment variable is now set in the environment for the user prolog when the job type is set to mpich.
- LoadLeveler can prevent the potential negotiator core dump caused by a race condition when querying a terminating job.
- The abort is now prevented by correcting the startd daemon locking when processing files in the execute directory during startup.
Problems fixed in LoadLeveler 3.5.1.14 [October 13, 2011]
- The llsubmit command will fail if the smt and rset keywords are used together.
- LoadLeveler can now handle jobs from users who belong to more than 64 system groups. The jobs submitted will not be rejected during launching on the compute node.
- The Negotiator has been changed so that it no longer depends on processor core 0 having CPUs configured. The Negotiator will no longer core dump if it encounters such a configuration.
Problems fixed in LoadLeveler 3.5.1.13 [August 4, 2011]
- LoadLeveler will not submit the job if there are no class in the default class list that can satisfy the job requirements.
- LoadLeveler "llq -s" command will provide information about why a step is in Deferred state.
- The unthread_open() error in the Schedd Log will no longer be printed when querying the remote cluster job since LoadLeveler will no longer try to route a nonexistent remote submit job command file in a multi cluster environment.
- LoadLeveler will pick up the default values for the undefined keywords after a reconfig. e.g. job_user_prolog
Problems fixed in LoadLeveler 3.5.1.12 [June 17, 2011]
- The llctl command will now check to make sure the Schedd daemon's port is available to be used before starting up LoadLeveler.
- A new field, Eligibility Time, is added to the llq and llsummary long listing output which records the last time the job became eligible for dispatch.
- LoadLeveler now creates cpuset files with permissions that are searchable by non-root users under the /dev/cpuset directory.
- The llmkres command should now be able to create the reservations consistently without hitting the timing error message 2512-856.
- LoadLeveler LoadL_negotiator daemon will not core dump when processing a multi-step job which contains a long dependency statement.
- A unique security issue has been identified for TWS LoadLeveler Web User Interface that could potentially compromise your system. It is recommended that you apply this update to protect your system.
Problems fixed in LoadLeveler 3.5.1.11 [April 7, 2011]
- Modifying the recurring reservation's attribute will now be seen in the first occurrence's attribute value under the llqres -l command.
- On Linux/P nodes, jobs requesting memory affinity with MCM_MEM_NONE, the job will always consume memory from the local MCM and will start paging once memory on the local MCM is over consumed; even though memory is available on other MCMs on the node. Now, if a job is submitted with memory affinity option, MCM_MEM_NONE, the task will be bounded to all the MCMs on the node and the memory will be consumed from all the MCMs on the node.
- Dependent steps are not given a new qdate when they are put onto the idle queue, while steps at the maxidle limit for a given user within a class are given a new qdate and a new sysprio based on that qdate. A change was made so that dependent steps are also counted as "queued" steps for the purposes of enforcing maxqueued and maxidle limits, and so a dependent step which is at the maxidle limit will get a new qdate.
Problems fixed in LoadLeveler 3.5.1.10 [February 10, 2011]
- LoadLeveler affinity RSET_MCM_AFFINITY cannot work on AIX because the vmo command output had changed. LoadLeveler had enhanced the vmo handler code so RSET_MCM_AFFINITY can be enabled now.
- The llsummary command might crash if the default class requirement value doesn't match the job requirement value. Fixed the llsummary command to select the correct requirement value from the default class list if there is no job class specified in job command file.
- The llsummary command will fail when it tries to access invalid data memory in the job history file. Fixed the llsummary command to be able to ignore the bad data areas and just report the valid data in the job history file.
- If the class-user sub-stanzas in the "default" class stanza are not defined in alphabetical order, the class-user sub-stanzas might incorrectly inherit the wrong values from the default class. LoadLeveler will now inherit the default values for the class-user sub-stanzas from the "default" class correctly.
- Loadleveler doesn't set the environment variable, LOADL_JOB_STEP_EXIT_CODE, when executing the user epilog script. LoadLeveler will now set the right environment variables when executing the epilog script.
- LoadLeveler schedd may ignore jobs if the job queue contains invalid job keys. The schedd daemon will now collect the correct job data when scanning the job queue files.
- The LoadLeveler command, llmodify, has a limitation where the startdate and wall_clock_limit job attributes cannot be modify for idle jo
bs. llmodify is now enhanced to be able to modify the startdate and wall_clock_limit job attributes for idle jobs.
-
New documentation:
- In the LoadLeveler Command and API Reference, SC23-6701-00, under Chapter 1. Commands, llmodify - Change attributes of a submitted job step,
- New keyword wall_clock_limit for the -k option: Changes the wall clock limit of a job step. The value of the specified wall clock limit must be longer than the value of the current wall clock limit. This is a LoadLeveler administrator only option.
- New keyword startdate for the -k option: Changes the start time of a idle-like job step. This is a LoadLeveler administrator only option.
- In the LoadLeveler Command and API Reference, SC23-6701-00, under Chapter 1. Commands, llmodify - Change attributes of a submitted job step,
Problems fixed in LoadLeveler 3.5.1.9 [December 16, 2010]
- Fixed the llclass command to show the correct value for the "Free Slots" field when LoadLeveler is configured to use the LL_DEFAULT scheduler.
- Fixed llchres command to check requested nodes additions to make sure those nodes have no jobs running on them or already assigned to another reservation. If no idle nodes can be found, the llchres command will fail.
- Fixed the schedd daemon so it will not crash if the job's output file path contains the "%" character.
- Fixed LoadLeveler to correctly reserve the reservation's resources after the central manager daemon restarts so that jobs with overlapping resources with the reservations will not be allowed to start.
- Fixed the central manager to make sure pending status changes to the machines are properly locked so that jobs being scheduled to the down machines will no longer crash the central manager daemon.
- Fixed the llsummary -s or -e command to report all jobs that match the filter requirement. In the TWS LoadLeveler documentation, Command and API Reference and the llsummary.l manual page, the -s and -e options will state the accounting data report will contain information about every job that contains at least one step that falls within the specified range.
- Fixed the job launch program so that it does not need to verify the group name so jobs will be executed using the submitting GID number.
- Fixed the negotiator crash by correcting the argument used to format the message which describes why the step cannot start be scheduled.
Problems fixed in LoadLeveler 3.5.1.8 [October 8, 2010]
- Fixed llsummary command to display the correct job id for jobs which have been moved from one schedd to another using the llmovespool command.
- Fixed the startd daemon to ignore the completion job command state if the job step was already terminated to prevent jobs from being stuck in the job queue.
- Fixed jobs to run on partitions that had removed exclude_bg keyword from the partition's default class configuration.
- Fixed LoadLeveler to do retries on the getpwnam() API so the correct passwd and group information will be retrieved if there are network issues instead of returning a "NOT FOUND" error.
- Fixed central manager deadlock and core dump from occurring by removing the completed step from the user and group class queues before the dependent steps get requeued.
- Fixed LoadLeveler from crashing by calling thread safe dirname() and filename() APIs during multi-thread executions.
- Fixed LoadLeveler to accept jobs with environment variables up to 100KB.
- Fixed the job step's completion code to return the wait3 UNIX system call status when the job is cancelled.
Problems fixed in LoadLeveler 3.5.1.7 [July 20, 2010]
- Resources will be held correctly if two reservations in the cluster were reserving the same resources with the second reservation's start time corresponding to the first reservation's end time.
- Fixed the llsummary command from crashing when the history file was being modified at the same time the command was trying to read it.
- Fixed the llsummary command to handle small data fragments in the history file so job steps will now be displayed correctly.
- Fixed the llacctmrg command from crashing if the global history file was greater than 2 GB.
- Fixed llsummary and llacctmrg commands to be able to access history files greater than 2GB.
- Fixed the central manager from crashing by locking the job step so different threads can not operated on it concurrently.
- Fixed the llqres command so that it will now work in a mixed 32 bit and 64 bit cluster environment without seeing the 2512-301 error message.
- Fixed the user prolog environment variables to be passed to the user epilog.
- Fixed Loadleveler to prevent duplicate job id error by trying other remote inbound schedds for remote job submission if the network connections to the inbound schedd is not stable.
- During file system failures, new mechanisms are implemented to reaccess file handlers in order to recover LoadLeveler to working state. The new implementations are to have a new timer to enable the schedd to come up automatically if file access was available and to set the schedd to drain state if the schedd file handlers can not be recovered.
Problems fixed in LoadLeveler 3.5.1.6 [May 20, 2010]
- Fixed the evaluation of consumable cpus calculation for jobs which dynamically turn smt on and off so jobs will be scheduled properly on a Power5 or Power6 systems.
- Fixed the negotiator core dump when the "START" expression was not configured for a machine.
- Fixed design of dependent steps to get new qdate when they are put onto the idle queue due to enforcement of the maxqueued and maxidle limits.
Problems fixed in LoadLeveler 3.5.1.5 [March 22, 2010]
- Fixed LoadLeveler to be able to honor the task order in the task_geometry keyword when assigning cpus to task ids.
- Fixed llstatus to display the correct configuration expressions for all expression keywords.
- Fixed the dispatch cycle of routed jobs so when the central manager failover takes place, the preempted jobs will now be able to run.
- Fixed LoadLeveler to set the environment variable from the prolog output if each line contains at most 65534 characters. All lines containing more than 65534 characters will be ignored.
- Fixed LoadLeveler jobs to start correctly in the login shell and know when to run under a login shell so the pmd will not hang during execution.
- Fixed the reservation debug message field so the central manager will not core dump.
Problems fixed in LoadLeveler 3.5.1.4 [January 18, 2010]
- Fix LoadLeveler from crashing when started in drain mode.
- Fix the LoadL_negotiator daemon from core dumping by initializing an internal variable that was being used.
- Fix LoadLeveler jobs from hanging in preempted pending state by correcting the machine state for the jobs being preempted.
- Fix the schedd daemon memory leak when processing reservations by removing the reservation element object after use.
- Fix the llsummary command segmentation fault by skipping over data that are not valid when processing the history file.
- Fix user id name to have a length up to 256 so jobs can now run when submitted using those ids.
- Fix llqres -l to output the correct days of the month under the Recurrence section if the month have less than 31 days.
- Fix LoadLeveler to send emails to the right administrator accounts when LoadLeveler detects errors.
- Fix LoadLeveler to execute the rescan function so jobs can not be scheduled once the running jobs are completed when using the default scheduler.
- Fix submitted jobs to be rejected when user id is not valid.
- Fix LoadLeveler to not send notification emails if the api process has already reported the errors.
- Fix LoadLeveler jobs to run with the correct gid on AIX platform.
- Fix LoadLeveler multi-step jobs to run with the correct umask value.
- Fix the negotiator daemon to ensure that resource counts are now being updated correctly when a step is canceled during the window of time after it has been scheduled but before the job start order has been dispatched.
Problems fixed in LoadLeveler 3.5.1.3 [November 2, 2009]
- Fix the job command file parsing error 2512-059 when the first non-blank line is neither a comment line nor a LoadLeveler keyword or if the first character of the first non-blank line is not a '#' sign.
- Fix the resource count for coschedule job steps so if the step is canceled after it has been scheduled and waiting for preemption to take place, those resource counts will now be updated correctly for future dispatching cycles.
- Fix LoadLeveler performance by reducing the overhead of handling llq query requests so that the impact to the overall scheduling progress is also reduced.
- Fix documentation on why using different flags for llq will generate different outputs for the same job.
Problems fixed in LoadLeveler 3.5.1.2 [August 19, 2009]
- Fix the LoadL_schedd SIGSEGV termination while many jobs are submitted by correcting reference counting on the data area while threads are still referencing it.
- Fix LoadLeveler to use unsigned int64 variables instead of integer for file size calculations whenever transmitting files, including transmitting history files that are greater than 2G to the llacctmrg command.
- Fix the llqres output to show the correct month value under the "Recurrence" section.
- Fix the LoadL_startd increased memory size consumption by modifying LoadLeveler to dynamically load the libraries only once.
- Fix the schedd memory leak when performing a llctl reconfig while having parallel, user space jobs on the queue in running state by correcting the memory leaks in the adapter objects.
- Fix the job step staying in the complete state for a long period of time by changing the central manager job termination/cleanup processing.
- Fix LoadLeveler to have better performance when scheduling jobs, especially in a cluster which has huge number of nodes with similar resources on each node.
Problems fixed in LoadLeveler 3.5.1.1 [May 18, 2009]
Notice: This is a mandatory service update to TWS LoadLeveler 3.5.1.0.
- Data staging options DSTG_NODE=MASTER and DSTG_NODE=ALL can now be used.
- Fix the accounting output of llsummary command to not have multiple same step entries after LoadLeveler restarts on a multistep job.
- Fix the child starter process to ensure it is started as root so that the process could set up the environment and credentials to run the job.
- Fix the negotiator handling of step dependencies so jobs that are supposed to run will run and those that shouldn't would not.
- Linux: On linux platforms with multiple CPUs, it is possible for the seteuid function to malfunction. When the LoadLeveler startd daemon encounters this failure, its effective user id may be set incorrectly, in which case it is possible for jobs to become stuck in ST state. A workaround to the glibc issue is provided in this service update.
Problems fixed in LoadLeveler 3.5.1.16 [February 23, 2012]
- Locking is added to the LoadLeveler schedd daemon to serialize threads receiving multi-cluster jobs from threads processing llq -x requests to prevent the daemon from core dumping.
Problems fixed in LoadLeveler 3.5.1.11 [April 7, 2011]
- LoadLeveler is changed to protect the schedd from core dumping if the same cluster stanza is configured as local for more than one cluster in a scale-across multi-cluster environment.
Problems fixed in LoadLeveler 3.5.1.3 [November 2, 2009]
- Fix LoadL_schedd memory leak when running the llstatus -X command in a multi-cluster environment.
- Fix LoadLeveler so jobs can be submitted to the remote cluster in a mixed 3.5.X and 3.4.3.X multi-cluster environment.
Problems fixed in LoadLeveler 3.5.1.1 [May 18, 2009]
Fix the llstatus -X from core dumping when there are adapters or MCMs on the nodes.
Problems fixed in LoadLeveler 3.5.1.17 [May 3, 2012]
- In a blue gene environment, if LoadLeveler is doing preemption and the preempting job affects exactly one running job with the same size, LoadLeveler will re-use the existing initialized partition, eliminating the need to boot a new partition of the same size.
Problems fixed in LoadLeveler 3.5.1.14 [October 13, 2011]
- The base partition state as well as the nodecard state will be checked before dispatching the job. If the nodecards that the job requires are available, the job will run even if not all nodecards are in a good state on the base partition.
Problems fixed in LoadLeveler 3.5.1.13 [August 4, 2011]
- When running jobs are preempted by new incoming jobs, the top-dog job will not lose its status as top-dog.
Problems fixed in LoadLeveler 3.5.1.11 [April 7, 2011]
- In a blue gene environment, the partition state will be checked before dispatching the job so that the job will not be scheduled onto a down partition.
- LoadLeveler can change the duration of an active Blue Gene Partition created by job command file on the BG/P system.
- In a Blue Gene environment a job was being scheduled to midplanes which had linkcards in ERROR state causing a failure when booting the partition. This caused jobs to be placed in HOLD. Now, jobs will not be scheduled to midplanes that have a linkcard error.
Problems fixed in LoadLeveler 3.5.1.9 [December 16, 2010]
- Fixed the llq -b command in a Blue Gene environment to not display invalid values as the partition state.
Problems fixed in LoadLeveler 3.5.1.6 [May 20, 2010]
The duration of an active blue gene partition can now be modified on the Blue Gene/P system.
Problems fixed in LoadLeveler 3.5.1.5 [March 22, 2010]
Fixed LoadLeveler Blue Gene jobs to start on free nodes by skipping over invalid partitions in the Blue Gene database during partition load and continue to load on valid partitions.
Problems fixed in LoadLeveler 3.5.1.1 [May 18, 2009]
Added scheduling enhancements to make it easier to find resources to run jobs on large Blue Gene systems.