Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 4.2.x applies for all supported platforms.
Problems fixed in IBM Spectrum Scale 4.2.3.23 [July 30, 2020]
- Item: IJ25055
- Problem description: Offline fsck requires a certain amount of pagepool memory
for it to run with a single inode scan pass and if the needed amount of
pagepool memory is not available it will display a warning message before
starting the fsck scan indicating the number of inode scan passes it will take
with the current available pagepool memory. It also shows the amount of
pagepool memory it would need to run a complete single inode scan pass.
For example below is the message displayed by fsck if there is insufficient
pagepool memory available for fsck to run with a single inode scan pass:
---------------- Available pagepool memory will require 3 inode scan passes by
mmfsck. To scan inodes in a single pass, total pagepool memory of 11767119872
bytes is needed. The currently available total memory f or use by mmfsck is
8604614656 bytes. Continue fsck with multiple inode scan passes? n
---------------- Now the problem is that in some cases it will display an
incorrect value of pagepool memory needed. Also another side effect of this
problem is that in some cases fsck might not show the above message and
instead shows the below incorrect message: --------------- There is not enough
free memory available for use by mmfsck in
. Continue fsck with multiple inode scan passes? n - Work around: While there is no specific work around here, so you can either choose to continue running fsck with multiple inode scan passes or else try to increase the pagepool with some random amount and keep incrementing as much as possible till it makes fsck run with a single inode scan pass.
- Problem trigger: This issue is most likely to trigger when offline fsck is run on a large filesystem but where the nodes do not have small pagepool memory.
- Symptom: Incorrect value in message
- Platforms affected: All
- Functional Area affected: FSCK
- Customer Impact: Suggested: has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24968
- Problem description: Command like mmcrsnapshot/mmdelsnapshot/mmunlinkfileset need a quiesce file system/fileset to get a consistent view. GPFS will wait for all processing vfs operations to be finished before it can mark the file system as quiesced. The new vfs operation started after quiescing will be blocked and it will wait for the file system to resume. In this case, the file system is mark as quiesced even though there is still an in processing vfs operation. That thread is holding a GPFS file lock and is trying to do a getxattr for a security namespace. The getxattr is blocked because the file system is quiesced. If the command (mmdelsnapshot in this case) needs to acquire that file lock to finish its job before the file system resumes, GPFS will deadlock.
- Work around: Disable selinux
- Problem trigger: 1) selinux is enabled (permissive mode or enforce mode) 2) Command like mmcrsnapshot/mmdelsnapshot/mmunlinkfileset need to quiesce the file system/fileset to get a consistent view
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Filesets Snapshots
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25052
- Problem description: mmsdrrestore fails with nqStatus not found.
- Work around: Copy /usr/lpp/mmfs/bin/mmremote from 4.2.3.19 or early then run mmsdrrestore. After the restore completed, restore the original mmremote file.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22895
- Problem description: mmdelnode command always assume GPFS is down on the node when that node is not pingable, no response, or refuses connection. This is a problem because the mmdelnode command might remove a node that it can't verify causing LOGSHUTDOWN
- Work around: Manually ensure GPFS is down before running the mmdelnode command.
- Problem trigger: Nodes may shutdown with LOGSHUTDOWN reason.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24831
- Problem description: Signal 11 happens while mmdiag --threads is running, like below: [E] Signal 11 at location 0x55F6FF53A47D in process 7840
- Work around: None
- Problem trigger: Running mmdiag --threads
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25467
- Problem description: GPFS daemon assert: exp(this->mutexMagic == MUTEX_MAGIC_VALID) dSynch.C. This could occur during file system unmount
- Work around: None
- Problem trigger: unmount a file system
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25470
- Problem description: GPFS daemon failed to start with "Cannot allocate memory" error when prefetchThreads is set to less than 3.
- Work around: Set prefetchThreads to 3 or higher.
- Problem trigger: Setting prefetchThreads to less than 3.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25510
- Problem description: GPFS daemon assert: !"Log file migrate check failed: need" in sgmdata.C. This could happen during mmrestripefs/mmdeldisk/mmrpldisk command.
- Work around: Rerun the command
- Problem trigger: File system panic on a client node while running mmrestripefs/mmdeldisk/mmrpldisk command.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25556
- Problem description: Input file with carriage return causes mmfileid to fail with arithmetic syntax error.
- Work around: Manually ensure that all input files are free of carriage return characters.
- Problem trigger: Carriage return characters in an input file.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested: has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25614
- Problem description: The mmunlinkfileset command hangs and long waiter "waiting to quiesce" appears. A thread is hung and waiting inside the gpfs_s_delete_inode kernel extension routine
- Work around: None
- Problem trigger: Run the mmunlinkfileset command (or commands which need file system quiesce like mmcrsnapshot and mmdelsnapshot) and delete files
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: Filesets, Snapshots
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25590
- Problem description: File system could panic with an error code 2 during the unmount process. This could happen if mmdelsnapshot command is running at the time of the unmount.
- Work around: Avoid unmounting the file system while the mmdelsnapshot command is still running
- Problem trigger: Unmounting the file system while the mmdelsnapshot command is still running
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: Snapshots
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25616IJ25616
- Problem description: In the unlikely scenario that the GPFS configuration file (mmfs.cfg) becomes corrupted, the mmfsd daemon may be affected.
- Work Around: None
- Problem trigger: A corrupted mmfs.cfg file.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25618
- Problem description: Even though the tenant contains no keys, it cannot be deleted when there are other clients registered to it. On the same cluster, the client should have already deregistered. So, it is likely that the registered client is from another cluster.
- Work Around: Deregister all registered key clients from the key server before deleting the tenant.
- Problem trigger: Deleting the tenant when there are other clients registered to it.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: Admin Commands Encryption
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25659
- Problem description: When a fileset is in deleteRequired state, a blank character is missing between the "latest:"string and the snapshot name if it is too long, thus leading to parsing issues on the snapshot name from the output of
- Work Around: None
- Problem trigger: A fileset is in deleteRequired state and a global snapshot contains this fileset with a long snapshot name.
- Symptom: Error output
- Platforms affected: All
- Functional Area affected: mmlsfileset command
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25622
- Problem description: AFM doesn't allow tuning the AFM tunables per node in the cluster. All of them seem to be only for the whole cluster level. Few of them like afmHardMemThreshold and afmMaxParallelRecoveries need to be tuned at each gateway node.
- Work Around: None
- Problem trigger: Try to set afmHardMemThreshold and afmMaxParallelRecoveries
using mmchnode command with -N
option. - Symptom: Unexpected Results/Behavior
- Platforms affected: All Linux and AIX OS nodes.
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25690
- Problem description: When a GPFS command is being blocked by another command, an informational message will be displayed to remind the user of the blocked command that the command will be resumed after the running conflicted command completes. This is not true for some long running commands, like mmlssnapshot and mmdelsnapshot.
- Work Around: None
- Problem trigger: Any two conflicted GPFS commands, like mmlssnapshot and mmdelsnasphot.
- Symptom: Informational message needs improvement
- Platforms affected: All
- Functional Area affected: Informational message for conflicting GPFS commands.
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25690
- Problem description: Fsck reports a bad DA in a to-be-deleted inode even though such inodes will be cleaned up during normal gpfs operations.
- Work Around: None needed; it is not harmful to allow fsck to repair the corruption.
- Problem trigger: Corrupt disk address in a to-be-deleted inode.
- Symptom: Error output/message
- Platforms affected: ALL
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25695
- Problem description: Running an administrative command within a few seconds after running a read-only offline fsck can lead to an assert.
- Work Around: None
- Problem trigger: Administrative command run right after offline fsck.
- Symptom: Error Abend/Crash
- Platforms affected: ALL
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25827
- Problem description: While deleting files spectrum scale internally also marks the files as free in its inode allocation map so that it can be reused. But if the process is interrupted (maybe due to some nodes going down) then in some cases that can cause the files to not show as in use in the inode allocation map even though they are not deleted. And in such cases if offline fsck is run then fsck can show false positive report about directory entry pointing to deleted inode. For example: file entry inode 553063 "file_name_13_for_tortDir_dirBlockSplits.tid4338.idx1.xxxxxxxxxxxxxx"
- Work Around: None required as the files were intended to be deleted by the user, so there is no harm if fsck repair deletes them.
- Problem trigger: Deleting of files getting interrupted due to node assert/crashes.
- Symptom: False positive fsck report
- Platforms affected: ALL
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25859
- Problem description: Cannot commit some log records successfully. But it might have updated the VtrackMap. Due to the resign, update to the memory metadata may not have happened. In such a scenario, if another thread is trying to do a VtrackMap flush operation it successfully writes the Metadata block, then the metadata version updated by the VtrackMap entry and Metadata block will be the same, although the log will contain the latest version of the record.
- Work Around: None
- Problem trigger: A RG resign during a track write operation
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: ESS/GNR
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25862
- Problem description: In a rare case, mmap(2)/munmap(2) system call may block file system quiesce and cause quiesce timeout
- Work Around: None
- Problem trigger: Commands like mmdelfileset, mmcrsnapshot, mmdelsnapshot will start file system quiesce. At the same time an application calls mmap(2)/munmap(2)
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Filesets/Snapshots
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25870
- Problem description: AFM builds up temporary files in the /var/ directory for recovery procedures of the AFM fileset. These files are not deleted until the next recovery on the same fileset.
- Work Around: User might have to remove the
/var/mmfs/afm/
- /recovery files manually, making sure that recovery has completed flushing the queue. - Problem trigger: Failure of the AFM normal queue, and recovery getting triggered on next operation on the fileset.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ25977
- Problem description: Cygwin version 3.1.5 released on June 1, 2020, has changed its implementation of symlinks. Cygwin symlinks are now Windows reparse points instead of the older-style system file with header. Due to this change, GPFS on Windows fails to interpret the new Cygwin symlinks. This results in errors during the GPFS daemon startup, specifically in its attempt to load the authorized public key.
- Work Around: Revert to older level of Cygwin (version 3.1.4 or earlier).
- Problem trigger: Upgrade to Cygwin version 3.1.5 (or later).
- Symptom: Abend/Crash.
- Platforms affected: Windows.
- Functional Area affected: Windows.
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ26291
- Problem description: The EA overflow block is a metadata block that should be read using a continuous buffer, but due to a code error, it is considered to be a data block, so a scatter buffer is used which causes a log assert failure.
- Work Around: If the size of scatter buffer is bigger than or equal to the EA overflow block size, the log assert can be avoid. Before GPFS version 5.0.0, the default scatterBufferSize value is 32KB that is small and more likely to hit this log assert. Changing the scatterBufferSize value to 256KB could be a work around.
- Problem trigger: If the file's EA entries cannot be placed in the inode, GPFS will place it in a per file EA overflow block. The size of the EA overflow block depends on how many EA entries the file has. If the scatterBufferSize value is smaller than the EA overflow block size, GPFS may hit log assert when the user accesses the file.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ26290
- Problem description: When a data management application receives a DMAPI postrename event, it fails to get the file handle for the renamed file with a "no such file" error. This is because Spectrum Scale is delivering a DMAPI postrename event before the Linux kernel updates its directory lookup cache for the file being renamed.
- Work Around: Let your data management application take a short sleep and retry to get the file handle of the renamed file name.
- Problem trigger: DMAPI enabled Spectrum Scale file system and a DMAPI postrename event.
- Symptom: The data management application gets a ENOENT error if they are trying to get the file handle for the renamed file immediately after receiving a DMAPI postrename event.
- Platforms affected: All
- Functional Area affected: DMAPI
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ26231
- Problem description: GPFS Access Control Lists (ACL) can only store limited types of Access Control Entries (ACE), specifically plain Access-Allowed-ACE, Access-Denied-ACE, System-Audit-ACE and System-Alarm-ACE. GPFS does not support storing of any of the Object-specific-ACEs corresponding to ACL_REVISION_DS. An attempt to set an ACL (containing the unsupported ACE types, such as the Object-specific-ACEs), can result in a kernel bugcheck.
- Work Around: None
- Problem trigger: The ACL being set on a GPFS file or directory must contain a unsupported ACE type, such as Object-specific-ACE.
- Symptom: Abend/Crash.
- Platforms affected: Windows
- Functional Area affected: Windows/ACLs.
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ26289
- Problem description: inode operations ->set_acl and ->get_acl are not supported in GPFS, and on kernel v3.14 some commands such as nfs4_setfacl may fail.
- Work Around: None
- Problem trigger: The missing fucntion in GPFS kernel portability layer to some newer kernel level
- Symptom: Error output/message
- Platforms affected: All Linux with kernel v3.14+
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII This update addresses the following APARs: IJ22895 IJ24831 IJ24968 IJ25052 IJ25055 IJ25467 IJ25470 IJ25510 IJ25556 IJ25614 IJ25590 IJ25616 IJ25618 IJ25622 IJ25644 IJ25659 IJ25690 IJ25695 IJ25827 IJ25859 IJ25862 IJ25870 IJ25977 IJ26231 IJ26289 IJ26290 IJ26291
Problems fixed in IBM Spectrum Scale 4.2.3.22 [May 14, 2020]
- Item: IJ22261
- Problem description: Due to the way mmfsck internally traverses reserved files and snapshots it is not able to report and fix duplicate addresses present among inode 0 files of the active file system and its snapshots. So as a result of this even though mmfsck -y runs successfully and reports file system as clean the duplicate address corruptions are not fixed and so the next mmfsck run will report some new corruptions like mismatched replicas present in inode 0. And there can also be fsstructs reported in the logs due to this after mmfsck -y
- Work around: Delete all the snapshots in the file system and then run mmfsck repair
- Problem trigger: ??
- Symptom: Operation failure due to FS corruption Also on a file system having snapshots the fsck output shows the below signs after a successful mmfsck -y run: 1) Mismatch replicas in inode 0 Error in inode 0 snap 0: Inode block 289710225 has mismatched replicas 2) Even though no duplicates are reported fsck shows the below Checking for the first reference to a duplicate fragment. 3) Even though no duplicates are reported we see a non-zero duplicates count at the end of fsck output 896 duplicates
- Platforms affected: ALL Operating System environments
- Functional Area affected: FSCK is not able to repair the corruption
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24453
- Problem description: While migrating a file dm_set_dmattr failed with rc 9 (EBADF)
- Work around: None
- Problem trigger: Migrating a file
- Symptom: Migration fails
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24451
- Problem description: HDR100 IB adapter is shown as '?x ?DR INFINIBAND' in mmfs.log
- Work around: None
- Problem trigger: Configure HDR100 IB adapter in verbsRdmaPorts
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23757
- Problem description: The GPFS kernel module exports an ioctl interface used by the mmfsd daemon and some of the mm* commands. The provided improvements result in a more robust functionality of the kernel module.
- Work around: None
- Problem trigger:
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23785
- Problem description: The GPFS mmfsd daemon services multiple types of requests received over multiple interfaces. The hardening of the mmfsd daemon results in a more robust functionality.
- Work around: None
- Problem trigger:
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24581IJ24581
- Problem description: When a file is being compressed or recompressed, a small write could go to update the cached data but dropped after the compression process is done.
- Work around: Stop doing file compression while small sequential I/O is in progress.
- Problem trigger: Doing small sequential I/O while file is being compressed.
- Symptom: Silent data loss.
- Platforms affected: All OS environments
- Functional Area affected: File compression or small sequential writes
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24602
- Problem description: The copy-on-write to snapshot could be missed when doing fast direct I/O writes, then cause the inode or file data miss to copy the snapshot before modifying a file.
- Work around: Stop using snapshots or applying this fix.
- Problem trigger: Doing small direct I/O when there is a snapshot created for the file system.
- Symptom: FSErrSnapInodeModified structure error.
- Platforms affected: All OS environments
- Functional Area affected: Direct IO with snapshot
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24427
- Problem description: During the process of restriping the file system, some nodes get unexpected long waiters that are waiting for a PIT_Stop or restart message.
- Work around: None
- Problem trigger: File system restripe when there are big files in the file system.
- Symptom: Unexpected long waiter
- Platforms affected: All
- Functional Area affected: File system restripe or other Spectrum Scale commands using PIT framework.
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24603
- Problem description: On RHEL 7.7 node, with supported GPFS versions 4.2.3.18 or higher and 5.0.4.0 or higher, when the kernel is upgraded to a version 3.10.0-1062.18.1 or higher, the node may encounter a kernel crash when accessing a deleted directory
- Work around: None
- Problem trigger: Accessing a directory which has been deleted
- Symptom: Abend/Crash
- Platforms affected: All RHEL 7.7 OS environments with kernel version equal or higher than 3.10.0-1062.18.1
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ15397
- Problem description: Using the mmchcluster command to enable CCR may fail. While the mmchcluster command is working to enable ccr, any other mm cmd can remove the authorized_ccr_keys file which is needed in the final step of the CCR enable. This problem occurs more often when the first quorum node in the list is on a GPFS supported systemd system.
- Work around: Run mmchcluster on a quorum node that does not support GPFS systemd. Or temporarily disable system health chmod 000 /usr/lpp/mmfs/bin/mmsysmon*
- Problem trigger: While the mmchcluster command is working to enable ccr, any other mm command can remove the authorized_ccr_keys file which is needed in the final step of CCR enable.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments with systemd version >= 219
- Functional Area affected: CCR Admin Commands
- Customer Impact: High Importance to customers that want to enable CCR.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24610
- Problem description: On Linux node with kernel v4.7 or later, when getting POSIX ACL data by the getxattr syscall and the buffer size is equal to the size of the POSIX xattr string value length, the POSIX ACL data size is changed to the input buffer size by setxattr. The kernel may crash due to the input buffer of getxattr getting over written.
- Work around: None
- Problem trigger: Kernel v4.7
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux nodes with kernel v4.7 or later
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24661
- Problem description: The directory files are considered metadata files and disk blocks are allocated from the system pool. Quota counts disk usage in data subblock units. In file systems where metadata pool and data pool subblocksizes are different, there could be a discrepancy (due to approximation) between what is actually allocated (in metadata subblocksize) and what is tracked by quota.
- Work around: None
- Problem trigger: Different metadata and data subblock sizes.
- Symptom: Inconsistent mmcheckquota results in file systems with different metadata and data subblock sizes.
- Platforms affected: All OS environments
- Functional Area affected: Quotas
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ24662
- Problem description: Problem with mmdsh to do a remote copy of a file
- Work around: Use scp to do a remote copy.
- Problem trigger: Using an undocumented option of mmdsh to do a remote copy.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: N/A
- Customer Impact: Suggested: has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ15397 IJ22261 IJ23757 IJ23785 IJ24427 IJ24451 IJ24453 IJ24581 IJ24602 IJ24603 IJ24610 IJ24661 IJ24662
Problems fixed in Spectrum Scale 4.2.3.21 [April 2, 2020]
- Item: IJ23054
- Problem description: When quorum nodes get removed from the cluster by using 'mmdelnode' and added back later for use as quorum nodes, the first attempt of 'mmchnode' (to declare those nodes as quorum nodes) might fail. This is caused by a CCR bug, which uses old cached outgoing connections not closed and delected, when those nodes have been removed from the cluster (during 'mmdelnode').
- Work around: Attempt the same 'mmchnode' command again, and it should succeed.
- Problem trigger: Executing 'mmchnode' to declare non-quorum nodes as quorum nodes, after those (other quorum nodes) nodes have been removed from the cluster shortly prior, and the cluster has just one quorum node. (at the time the new quorum node is being declared)
- Sympton: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: -CCR -Admin Commands (mmchnode)
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23057
- Problem description: A node/thread/terminal running mmafmctl
getstate command can get deadlocked with the FS manager while trying to create/delete filesets or while trying to link/unlink them. With dependent filesets linked under AFM filesets the possibility of deadlock increases. - Work around: None
- Problem trigger: Running the "mmafmctl
getstate" command when the FS manager is creating/linking/unlinking/deleting filesets on the same Filesystem. - Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23061
- Problem description: Compression problems
- Work around: Stop compression
- Problem trigger: start mmap write on file being compressed.
- Symptom: FSErrBadCompressBlock structure error
- Platforms affected: All
- Functional Area affected: File compression, mmap
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23147
- Problem description: AFM relies on .afm/.afmtrash directory at the remote sites to handle conflict operations that arise at the Cache/production site. When conflicts arise, AFM moves the particular entity to .afm/.afmtrash and continues with the operation in question. For non-Scale Filesystem at remote, this .afmtrash directory is not present and such conflicting operations get stuck forever.
- Work around: None
- Problem trigger: Having SW fileset targeting a non-GPFS home where .afm/.afmtrash directory is not available. And on such SW fileset creating a sequence of operations that can cause the cache to see conflict with home and take evasive action to move the entire entity to the .afmtrash directory.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS and AIX environments
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23109
- Problem description: In rare situations, the quota shares relinquished by quota clients in the first phase of the mmcheckquota command, may not be flushed to disk and if a new quota manager is appointed it may fetch stale in-doubt values from disk.
- Work around: None, but avoiding new quota manager instance (umount and mount or mmchmgr) can decrease the window of opportunity that the stale in-doubt information remains on disk.
- Problem trigger: mmcheckquota followed by a new quota manager in a busy system.
- Symptom: After mmcheckquota, the in-doubt information provided by quota commands is different from the in-doubt information presented by a newly appointed quota manager. This is one of the possible causes of in-doubt not decreasing after long time of quota inactivity;
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23117
- Problem description: AFM tends to filter out a few messages getting queued to avoid unnecessary playback of such messages to the remote site. While filtering such operations, if the Fileset moves to unmounted state because of a change in home - then the queueing back of pending operations causes the Daemon on a gateway to assert.
- Work around: None
- Problem trigger: While trying to remove files on which lookups are queued, at the same time due to some error at the home the Cache fileset encounters Unmounted state.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23151
- Problem description: When AFM runs recovery/resync today to catch up missed updates from cache/production to home/remote site, it tends to populate recovery/resync ops and only flush them. Any live ops are held playing to remote until the recovery/resync can complete. This should not be the case for files that are evicted and awaiting recovery/resync to complete.
- Work around: Wait till Recovery/Resync completes so that we're able to read the file data back from the remote site.
- Problem trigger: Read the evicted file when the Recovery/Resync procedures are in progress
- Symptom: IO error
- Platforms affected: ALL Linux OS and AIX environments
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23152
- Problem description: Kernel crash with UID remapping enabled due to the NULL pointer dereference in the tracing.
- Work around: None
- Problem trigger: UID remapping with remote cluster mounts
- Symptom: Crash
- Platforms affected: All
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23153
- Problem description: For a file system with multiple storage pools defined, 'df' command may temporarily show 0% free space after mounting the fs (It can happen both on file system manager and client node). In most cases, the problem will disappear in a sync period. But if a client node does not do block allocation/deallocation, this problem can persist forever.
- Work around: There are several workarounds 1) Run 'mmdf' command can solve this problem, but this command may be time consuming 2) On problematic client node, do some block allocation or deallocation on this node then run 'sync'
- Problem trigger: Running the df command during a file system mount
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23119
- Problem description: The tspcacheutil program is used for looking at what is the state of each file/dir/entity at the cache/production site with respect to its home/DR site. This program cannot handle 64 bit inode numbers and is seen to pick up random 32 bit inode numbers and print its stats.
- Work around: None
- Problem trigger: Run tspcacheutil
on a file with inode number in the 64 bit inode range (> 4B) - Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS and AIX environments
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23127
- Problem description: Disk problem could cause newly created log file to become inconsistent and this in turn could cause file system panic during log recovery. All attempts to mount the file system will fail when this occurs.
- Work around: None
- Problem trigger: Disk Error and node failure that require log recovery
- Symptom: Cluster/File System Outage
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23137
- Problem description: AFM prefetch does not work if the files have 64 bit inode numbers assigned to them. When checking the file for the cached bit, a 32 bit inode number is used and the integer overflow might cause a file's cached state to be erroneously returned as true.
- Work around: None
- Problem trigger: AFM prefetch
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23410
- Problem description: tsfindinode could skip scanning additional fileset/snapshots after encounter an error while trying to open a directory for scan. This could cause tsfindinode to not find all the files.
- Work around: Run tsfindinode multiple times and avoid changing directory tree while tsfindinode is running.
- Problem trigger: Running workload that changes directory tree while tsfindinode is running.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: Admin commands
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ23426
- Problem description: The tsusercmd command is used by several mm commands. Hardening of the tsusercmd command provides a more robust execution of the mm commands.
- Work around: None
- Problem trigger:
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: Admin commands
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII This update addresses the following APARs: IJ23054 IJ23057 IJ23061 IJ23109 IJ23117 IJ23119 IJ23127 IJ23147 IJ23151 IJ23152 IJ23153 IJ23410 IJ23426
Problems fixed in IBM Spectrum Scale 4.2.3.20 [February 20, 2020]
- Item: IJ21910
- Problem description: Excessive RDMA errors are being logged.
- Work around none:
- Sympton: Error message
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22030
- Problem description: If too many pdisks are unreadable (not missing) inhibiting writes to a vtrack, it is possible that ESS will put stale strips of information to the metadata log. When scrubber tries to scrub the vtrack, it will examine this stale strip data and declare data loss.
- Work around: none
- Problem trigger: Unavailability of pdisks to do a vtrack write.
- Sympton: IO error.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: ESS/GNR
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21802
- Problem description: Running mmsdrrestore against a quorum node in a CCR-enabled cluster will crash the GPFS daemon.
- Work around: Shutdown GPFS before performing mmsdrrestore
- Problem trigger: Running mmsdrrestore against a quorum node in a CCR-enabled cluster
- Sympton: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: CCR Admin Commands
- Customer Impact: Critical: an issue which may cause an application to fail.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22031
- Problem description: On a node with multiple file system mounted, DiskLeaseThread could be blocked by a file system unmount causing delay in renewal of disk lease and potential quorum loss.
- Work around: none
- Problem trigger: File system unmount
- Sympton: Node expel/Lost Membership
- Platforms affected: ALL
- Functional Area affected: Cluster Membership
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21803
- Problem description: Deadlock could happen if quorum loss occurs on a newly appointed stripe group manager. Threads could be stuck in 'waiting for stripe group takeover' and 'waiting for SG cleanup'.
- Work around: none
- Problem trigger: Quorum loss just as a node start taking over file system manager role
- Sympton: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21868
- Problem description: FSSTRUCT error: FSErrCheckHeaderFailed could be issued while accessing some directory. This could happen on a file system with metadata replication where there is metadata disk in down state and node failure.
- Work around: none
- Problem trigger: Metadata disk in down state and node failure.
- Sympton: Error output/message
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21909
- Problem description: Revalidation on the fileset root path might not happen correctly if the gateway is running some operating systems like RHEL 7.7. This causes the new data from the target path not to be fetched from the home.
- Work around: none
- Problem trigger: Revalidation on the fileset root path in the AFM caching modes.
- Symptom: Unexpected behavior/results
- Platforms affected: Certain Linux OS environments, like RHEL 7.7
- Functional Area affected: AFM
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21935
- Problem description: Correct description of the resumeRequeued command to indicate that the filesetName is a required argument.
- Work around: none
- Problem trigger: Running the mmafmctl command as recommended in the manpage and the command's help.
- Symptom: mmafmctl shows wrong help - not mandating the filesetName for
mmafmctl
resumeRequeued sub command. - Platforms affected: All Linux and AIX environments.
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22033
- Problem description: When deleting a snapshot, the process may miss to move the data blocks of being delete snapshot files with small inode numbers. The inodes with small numbers must be in the same inode block with fileset metadata file, and not in the first inode block of inode 0 file.
- Work around: none
- Problem trigger: Deleting a snapshot which contains a file with small inode number
- Symptom: Data corruption
- Platforms affected: All
- Functional Area affected: Snapshot
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21944
- Problem description: On Linux node with kernel version 4.7 or later, when copy one source file with command cp -p, the ACL data is lost in the destination file, if the source file contains many ACL entries, for example, 20+ ACL entries.
- Work around: none
- Problem trigger: Defect in porting of GPFS to Linux kernel version 4.7.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux nodes with kernel version 4.7 or later
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21598
- Problem description: When handling page fault GPFS didn't detach I/O buffer segment. This later caused kernel crash.
- Work around: none
- Problem trigger: Multiple threads doing both normal I/O and mmap I/O on the same file at the same time.
- Symptom: Kernel crash
- Platforms affected: AIX
- Functional Area affected: Mmap I/O
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22035
- Problem description: TSM client version can contain 2 or more digits in any position of V.R.M.F but mmbackup cannot handle such case. As a result, mmbackup fails while parsing TSM client version.
- Work around: none
- Problem trigger: Executing mmbackup with TSM client 8.1.10.
- Symptom: Component Level Outage
- Platforms affected: ALL Operating System environments
- Functional Area affected: mmbackup
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21752
- Problem description: When running the command "mmdiag --waiters" or "mmfsadm dump waiters" the GPFS user space daemon (mmfsd) could cause a memory overflow triggering a signal 6. This will cause the daemon to fail and restart. Note these commands are executed periodically by the health monitoring function of Spectrum Scale.
- Work around: none
- Problem trigger: Either the mmdiag --waiters or mmfsadm dump waiters command.
- Symptom: Daemon crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: Long waiters detection and dump
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22177
- Problem description: File creation could fail unexpectedly with EFBIG error. This could happen when multiple nodes access the same directory while 1 node repeatedly create and delete the same file in the directory.
- Work around: Perform rename on a file in the directory after encounter EFBIG error.
- Problem trigger: Repeatedly create and delete the same file in a directory.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Operating System environments
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22084
- Problem description: When deleting a global snapshot, if the snapshot refers to a deleted fileset then the assert will be triggered.
- Work around: none
- Problem trigger: This problem only happens when deleting a global snapshot, while a fileset included in it has been deleted.
- Symptom: Daemon abend
- Platforms affected: ALL Operating System environments
- Functional Area affected: Global snapshot deletion
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22186
- Problem description: Daemon crashed with assert ofP->metadata.notAFragment(subblocks). It may occur in appending data to a file after previous write was failed due to invalid data buffer in application.
- Work around: Make sure the user data buffer is valid before write data into the scale file system
- Problem trigger: An invalid user data buffer caused GPFS to fail when writing data to a file while leaving the invalid data in the buffer. A flush of the buffer incorrectly set the file's fragment to a full block which resulted in a failure to expand the last block of the file, triggering the assert.
- Symptom: Scale daemon crashed with assert ofP->metadata.notAFragment(subblocks) in bufdesc.C
- Platforms affected: ALL Operating System environments
- Functional Area affected: Core
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22085
- Problem description: On one side AFM fileset is being deleted and on another side there's a getstate to show AFM fileset states. The getstate command picks the fileset being deleted to print its stats, and causes the Assert.
- Work around: Do not run "mmafmctl
getstate/mmdiag" commands when AFM filesets are being Deleted. - Problem trigger: On one side an AFM fileset is being deleted (which could take
time depending on number of inodes in the fileset and amount of data). While
this is happening another node in the cluster queries AFM stats on the AFM
filesets. (mmafmctl
getstate (or) an mmdiag running). - Symptom: Abend/Crash
- Platforms affected: ALL Linux and AIX environments.
- Functional Area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22099
- Problem description: Online replica compare function (mmrestripefs -c) could give incorrect replica mismatch error on directories. This could happen if subblock size for metadata is greater than 256K.
- Work around: None
- Problem trigger: Run mmrestripefs -c on file system with metadata subblock size greater than 256K.
- Symptom: Error output/message
- Functional Area affected: Admin Commands
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22189
- Problem description: EACCESS error is returned to NFS client from Ganesha Server and it can cause IO failure for metadata access (ls command) for file/directory or can fail rm operation on the directory.
- Work around: None
- Problem trigger: It is difficult to recreate but possible reason could be file/directory move/deletion from parent directory which leaves a disconnected dentry in the linux kernel.
- Symptom: IO failure
- Platforms affected: Linux Only
- Functional Area affected: NFS Ganesha
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22139
- Problem description: Make timeout of commMsgCheckMessages RPC consistent on all nodes and issue a warning message if it took more than one third of the timeout to get the reply of commMsgCheckMessages RPC.
- Work around: None
- Problem trigger: Network is not good which leads to sending commMsgCheckMessages RPC
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22149
- Problem description: GPFS command mmchattr stores extended attribute name value pair into the inode itself the same even for ACL xattr, which should be stored into GPFS internal ACL file. This behavior of ACL xattr handling may confuse users.
- Work around: None
- Problem trigger: None
- Symptom: Confused output
- Platforms affected: Linux
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22157
- Problem description: In AFM Stopped and Queue dropped states, when file/directory are removed at the cache site the inode is still seen as USERFILE and is not reclaimed.
- Work around: None
- Problem trigger: Running applications/workload when AFM fileset is in Stopped state.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All Linux and AIX operating systems.
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ22713
- Problem description: A linux mknod operation for a FIFO object can encounter this assert if the object is opened before the operation completely finishes.
- Work Around: The assert can be disabled with the assistance of service via "mmchconfig disableAssert"
- Problem trigger: A linux mknod operation to create a FIFO object while another process attempts to open the same object (not actually waiting for the create to complete).
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ01910 IJ21598 IJ21752 IJ21802 IJ21803 IJ21868 IJ21909 IJ21935 IJ21944 IJ22030 IJ22031 IJ22033 IJ22035 IJ22084 IJ22085 IJ22099 IJ22139 IJ22149 IJ22157 IJ22177 IJ22186 IJ22189 IJ22713.
Problems fixed in IBM Spectrum Scale 4.2.3.19 [December 5, 2019]
- Item: IJ21108
- Problem description: If encryption is not configured properly, starting down disks could result in mismatched replicas.
- Work around: None
- Problem trigger: During the "start disk", repairing mismatched replicas failed on certain files because encryption context was not available, and the error E_ENC_CTX_NOT_READY was treated as a SEVERE error which means that the code continues to repair the replicas to the degree possible. In the final phase of repair, the missupdate flag was incorrectly cleared from the inode even though we did not synchronize the replicas, as the repair failed due to unavailable encryption context. As the missupdate flag was cleared from the inode, a subsequent "start disk" brought up all down disks, but the file still had mismatched replicas. A later "mmrestripefs -c" may then pick up the wrong replica and overwrite the good replicas.
- Symptom: Encrypted replicas mismatch after start disk.
- Platforms affected: All
- Functional Area affected: Core
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21069
- Problem description: When updating the file size for preallocation, the new file size is calculated incorrectly, which results in an unexpected file size.
- Work around: Do not try to preallocate the same block more than once.
- Problem trigger: In an FPO cluster, the problem can be triggered if one tries to pre-allocate the same block more than once and the second request has a larger file size.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: FPO
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20812
- Problem description: In case a fsck worker node returned error because another worker node had failed, then
- Work around: None
- Problem trigger: An fsck worker node fails.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: Fsck
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20882
- Problem description: Deadlock occurs with multiple threads reading from the same file concurrently.
- Work around: None
- Problem trigger: Reading a single file from multiple threads.
- Symptom: Deadlock
- Platforms affected: All OS environments
- Functional Area affected: All
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20842
- Problem description: When a remote node joins the home cluster as part of the remote FS mount, cluster manager in the home cluster sends the information about all the active nodes in the cluster to the remote node as part of the join protocol. AFM uses the number of available gateway nodes in the cluster to select the gateway node for the fileset by hashing the fileset Id into the number of available gateway nodes. Since the remote nodes receive only the active nodes information during the join protocol, remote nodes incorrectly selects the gateway node for the fileset if any of the gateway nodes are down in the home cluster when the remote node joins the cluster. This causes the cluster wide deadlocks as the application node keeps retrying the request to incorrect gateway node by blocking the file system quiesce.
- Work around: Make sure that all the AFM gateway nodes are up and running in the home cluster.
- Problem trigger: Remote cluster mount, AFM enabled filesets with down gateway nodes in the home cluster.
- Symptom: Deadlock
- Platforms affected: All
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20839
- Problem description: Hang can happen when accessing memory mapped file with mmapRangeLock configuration variable disabled.
- Work around: Do not disable mmapRangeLock
- Problem trigger: Problem can be triggered with mmapRangeLock disabled when accessing memory mapped files.
- Symptom: hang on byte range lock
- Platforms affected: Linux
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20859
- Problem description: During offline fsck multi-pass directory scan, if patch queue feature is disabled and --skip-inode-check option is used, then fsck tries to access an out of range entry in dotdotArray and hits this assert.
- Work around: None
- Problem trigger: Multi-pass offline fsck --skip-inode-check with patch queue feature disabled.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: FSCK
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20862
- Problem description: Daemon crashes due to invalid config setting where enableStatUIDremap is enabled without enabling the enableUIDremap config option.
- Work around: Enable both enableUIDremap and enableStatUIDremap options.
- Problem trigger: UID remapping with invalid config options.
- Symptom: Crash
- Platforms affected: All
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20840
- Problem description: Due to a bug, fsck continues to process a deleted inode and marks it as an orphan which causes this assert.
- Work around: Patch the problematic inode using tsdbfs so that the inode is no longer corrupt and retry fsck.
- Problem trigger: A deleted inode is corrupt.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: FSCK
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20309
- Problem description: mmfsd asserts with expression "updates.size() < MAX_INDBLK_UPDATE" while creating snapshots. This happens during the indirect block updates when the number of allocated inodes is high enough that the indirection level of inode0 becomes INDIRECT (2 or higher).
- Work around: Disable logAssert using the filename and line number. Example mmchconfig disableAssert=llio.C:16103
- Problem trigger: Multiple snapshots creation
- Symptom: Abend/Crash
- Platforms affected: All OS environments
- Functional Area affected: Snapshots
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20306
- Problem description: When users configure their cluster with different admin node names and daemon node names, and no remote shell passwordless configuration for daemon node names, then snapshot restore does not work.
- Work around: None
- Problem trigger: Snapshot restore under environment with no remote shell passwordless configuration for daemon node names.
- Symptom: Snapshot restore failure
- Platforms affected: All AIX and Lnux Operating System environments
- Functional Area affected: Snapshot restore
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20180IJ20180
- Problem description: When running fsck in repair mode (-y) the file system manager node asserts with "logAssertFailed: newSize > 0 && newSize <= dlP->dirBlockSize" on trying to fix directory having a bad directory block corruption and this also fails the fsck run.
- Work around: Disable the fsck patch queue feature by using the below command and then rerun fsck -y mmdsh -N all mmfsadm test fsck usePatchQueue 0
- Problem trigger: Fsck is run in repair mode to fix a file system having bad directory block corruption.
- Symptom: Node assert and fsck failure
- Platforms affected: All
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20150IJ20150
- Problem description: QOS may deadlock on the file system manager node, particularly if there are many (hundreds) of nodes mounting the file system and the manager node is heavily CPU or network loaded.
- Work around: 1) mmchqos FS stat-slot-time 15000 stat-poll-interval 60 or if that is not sufficient... 2) Disable QOS until fix is available.
- Problem trigger: There are many (hundreds) of nodes mounting the file system and the manager node is heavily CPU or network loaded.
- Symptom: Hang or Deadlock
- Platforms affected: All
- Functional Area affected: QOS
- Customer Impact: High Importance, especially for customers using QOS with hundreds of nodes mounting the file system.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20179
- Problem description: When "fs 1" trace is enabled, if the lenth of xattr value is smaller than 16B and not an even multiple of 4 bytes, gpfs will read over the end of xattr value. If the value string happens near the end of the block buffer, gpfs may access invalid memory address and cause kernel oops.
- Work around: No work around, but it's a very rare case, and can only happen when "fs" trace is enabled. Disable trace.
- Problem trigger: Users having applications which access file's EA attribute value are potentially affected. But this is a very rare case, and can only happen when "fs" trace is enabled.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: Trace
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20308
- Problem description: On a client node (which is not current file system manager) the 'df' command may temporarily show 0% free space after a file system is mounted. In most cases, the problem will disappear in a sync period (generally 5 - 30 seconds). However, if the client node does not do block allocation/deallocation the problem may persist.
- Work around: There are several work arounds to alleviate this problem. 1) Use the 'mmchmgr' command to assign a file system manager before mounting the file system. Then after mounting the file system on the node assigned as the file system manager wait a short time, say 10 seconds, then mount the file system on other nodes. If above work around does not work for you, for example if you enabled auto mount, you can use below work around to solve this problem 2) Run the 'mmdf' command on the node where df returns 0% free space. 3) Run the 'sync' command on the node where df returns 0% free space 4) Execute an operation that will require the file system to do a block allocation, for example create a new file and write data into it, or append data to an existing file.
- Problem trigger: This is caused by a infrequently occurring race condition between mounting a file system and the initialization of internal allocation tracking structures.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20863
- Problem description: When writing to a memory mapped file that was compressed, it fails with SIGBUS when mmapRangeLock config variable is disabled.
- Work around: Don't disable mmapRangeLock config variable
- Problem trigger: Writing to memory mapped files that were compressed, and mampRangeLock config variable is disabled.
- Symptom: Application fails with SIGBUS
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20864
- Problem description: Node crashes with assert when the AFM fileset with active IO is unlinked.
- Work around: Stop AFM fileset and then unlink the fileset.
- Problem trigger: Fileset unlink with active IO.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: AFM
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20893
- Problem description: After upgrading a cluster with a pre-4.1 file system which has quotas enabled, the old user visible quota files will be converted to GPFS internal files. This changes is kept in the stripe group descriptor for the file system. However, this change is not broadcast to all nodes and causes a metadata inconsistency leading to the assert
- Work around: Method 1) "Run the commands mmumount -a", then "smmmount -a" after upgrading pre-4.1 fs which has quota enabled Method 2)Execute commands that update the stripe group descriptor for the file system, for example use mmchdisk to suspend then resume one of the disks of the file system
- Problem trigger: After upgrading pre-4.1 fs which has quota enabled, user.quota, group.quota and fileset.quota will be migrated to regular files. In rare cases, accessing them (through VFS interface or accessing internally by tools like mmrepairfs) may cause log assert.
- Symptom: Abend
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20865
- Problem description: GPFS daemon could assert when trying to mount a file system. This could happen after a node failure and file system is being mounted again after daemon restart. File system manager node would also fail with an assert.
- Work around: None
- Problem trigger: A client node failure
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item IJ20896
- Problem description: The GPFS daemon (mmfsd) consumes high CPU load on a quorum node when Windows 2016 is used as the operating system. This is caused by a CCR thread listening to incoming CCR requests on cached connections from other quorum nodes by using the poll system call. This logic doesn't consider particular flags returned by the poll system call (in detail: POLLHUP, POLLERR, POLLNVAL). A second GPFS daemon (mmsdrserv) might be affected by this issue. This daemon is running when GPFS has been shutdown by the mmshutdown command. This issue doesn't occur on Linux or AIX.
- Work around: Assign other nodes as quorum nodes which don't use Windows 2016 as the underlying operating system, if possible, e.g. nodes in the cluster running on Linux or AIX.
- Problem trigger: GPFS startup (mmsdrserv starts automatic, mmfsd after 'mmstartup -a')
- Symptom: -Performance Impact/Degradation -Unresponsiveness
- Platforms affected: Windows 2016 (at least, earlier/later Windows version might be affected too)
- Functional Area affected: CCR and admin commands
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20870
- Problem description: The RPO thread which takes care of creating RPO snapshots for AFM DR filesets, is taking locks on all filesets in filesystem before it can see which filesets require RPO snapshots to be taken. This includes any non-AFM independent/dependent filesets as well.
- Work around: None
- Problem trigger: Having multiple AFM DR Primary filesets with RPO intervals enabled.
- Symptom: Performance Impact/Degradation Hang/Deadlock/Unresponsiveness/Long Waiters (Lesser probability)
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM Snaphots
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20898
- Problem description: After migrating a file from GPFS to external storage any indirect blocks used by the file are not freed.
- Work around: None
- Problem trigger: Migration of large files, requiring indirect blocks, to external storage
- Symptom: Metadata disk space is not freed after files are migrated to external storage.
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20899
- Problem description: On AIX, when trying to clear/write the primary GPT area, mmcrnsd does non-4k aligned writes to 4K disks while trying to preserve the OS PVID, causing a failure.
- Work around: None
- Problem trigger: Create an nsd out of 4kb sector size native disk(s) on AIX
- Symptom: Error output/message
- Platforms affected: AIX
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20872
- Problem description: Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands
- Work around: None
- Problem trigger: Command injection
- Symptom: May not be any errors, or you may see Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: SMB
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ20884
- Problem description: Problem description: Attacker can inject arbitrary commands through Spectrum Scale GUI or CLI when using mmsmb commands
- Work around: None
- Problem trigger: Command injection
- Symptom: May not be any errors, or you may see Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: SMB
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21109
- Problem description: The -c option of mmrestripe issued without --read-only option will attempt to fix any conflicting replicas. This may cause corruption if it cannot determine which replica has the latest version, and fixes the conflict by arbitrarily choosing one of the replicas as the source of the copy.
- Work around: Issue the mmrestripefs command with --read-only option.
- Problem trigger: Restripe, when mismatched replicas occur
- Symptom: Operation failure due to FS corruption
- Platforms affected: All
- Functional Area affected: All Scale Users when restripe -C is used
- Customer Impact: Critical: an issue which will cause an application to fail, a silent data corruption or data loss or loss of major capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ21071
- Problem description: On RHEL 7 nodes (pre-Linux kernel v3.18), in the GPFS kernel NFS support environment, GPFS may try to acquire some mutex, while holding an inode spin lock, which may be detected as a soft lockup issue by the kernel NMI watchdog.
- Work around: None
- Problem trigger: GPFS breaks some spin lock holding policy in NFS support environment
- Symptom: Performance Impact/CPU stuck
- Platforms affected: All RHEL 7.x
- Functional Area affected: All users of KNFS/CNFS
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ20150 IJ20179 IJ20180 IJ20306 IJ20308 IJ20309 IJ20812 IJ20839 IJ20840 IJ20842 IJ20859 IJ20862 IJ20863 IJ20864 IJ20865 IJ20870 IJ20872 IJ20882 IJ20884 IJ20893 IJ20896 IJ20898 IJ20899 IJ21069 IJ21071 IJ21108 IJ21109.\n
Problems fixed in IBM Spectrum Scale 4.2.3.18 [October 3, 2019]
- Item: IJ19126
- Problem description: GPFS daemon crashed and the file system was temporary inaccessible on this node
- Work around: None
- Problem trigger: A race between a thread doing background file deletion and a thread attempting to steal a buffer.
- Symptom: Abend/crash
- Platforms affected: all
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19630
- Problem description: When the long waiter threshold is exceeded automatic debug data collection is triggered to help diagnose the problem. When the value of debugDataControl is "heavy" and tracing is off, debug data collection will capture 20 seconds of trace data. On AIX, due to a problem in the mmtrace command, tracing was not disabled after 20 seconds.
- Work around: None
- Problem trigger: Long waiter threshold is exceeded, debugDataControl is set to "heavy", and tracing is disabled.
- Symptom: Performance Impact/Degradation
- Platforms affected: AIX
- Functional area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19639
- Problem description: The "mmafmctl
getstate" command acquires a lock on all filesets in the file system to print statistics about the AFM enabled filesets, even filesets that are not AFM enabled. - Work around: Do not run the command, "mmafmctl
getstate". - Problem trigger: Running the command, "mmafmctl
getstate" command while other fileset related commands, for example mmdelsnapshot, are active. - Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments.
- Functional area affected: AFM
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19047
- Problem description: GPFS daemon crashed and all the file systems on the node are unmounted.
- Work around: None
- Problem trigger: Trying to find data to commit to the HAWC log.
- Symptom: Abend/crash
- Platforms affected: all
- Functional area affected: gpfs core
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19049
- Problem description: mmchnsd command does not work if there are "to be emptied" disks.
- Work around: Change the disks that are in "to be emptied" state to another state.
- Problem trigger: Execute the mmchnsd command while there are "to be emptied" disks state in the file system.
- Symptom: Error output/message Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: Admin commands
- Customer Impact: Suggested: has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19240
- Problem description: With Spectrum Scale versions 4.2.3.13 and higher or version 5.0.2.2 and higher, running on RHEL 7.6 with kernel version 3.10.0-957.14.1 and higher, the node may encounter an I/O error when accessing a renamed directory. For example, on a node with RHEL 7.6 and kernel version 3.10.0-957.14.1 or higher, execute "cd dir1". On another node in the cluster rename the dir1 directory to the new name, for example, "mv dir1 dir2". On the node with RHEL 7.6 dir2 can not be accessed.
- Work around: On the node, exit from the directory with old name for example "cd ..", and access it again by executing ls -ld. The directory can now be seen and accessed via its new name.
- Problem trigger: This issue affects customers running IBM Spectrum Scale V4.2.3.13 or higher and 5.0.2.2 or higher under the following scenarios: - access some directory on the RHEL 7.6 node, for example, cd dir1 - rename the directory name to the new name on the other cluster node, for example, mv dir1 dir2 - then, I/O error which occurs when accessing the renamed directory on the RHEL 7.6 node.
- Symptom: I/O error
- Platforms affected: All RHEL 7.6 OS environments with kernel version high than 3.10.0-957.14.1
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18593
- Problem description: When the DMAPI function dm_getall_disp dampi() is called with bufLen = INT_MAX GPFS the function returns a buffer of smaller size due to integer overflow. This may cause memory corruption when moving data from this small buffer to a user buffer which is larger than the buffer allocated.
- Work around: None
- Problem trigger: DMAPI api calls that provide a buffer length of INT_MAX
- Symptom: Node hangs or crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users with DMAPI enabled GPFS filesystem
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18595
- Problem description: A race between the thread handling dm_create_session and an mmchmgr command caused a new DMAPI session id not to be sent to the new GPFS configuration manager node. When a dm_destroy_session is called to destroy that session id it failed with the error EINVAL since the new configuration manager node doesn't know about that session id.
- Work around: None
- Problem trigger: Running mmchmgr when dm_create_session is in progress.
- Symptom: dm_destroy_session fails with EINVAL error.
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users with DMAPI enabled GPFS file systems
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18599
- Problem description: A race between a thread handling node recovery and a thread trying to generate a DMAPI event caused an assert because of the status change of the DMAPI event.
- Work around: None
- Problem trigger: Node failure and threads accessing migrated files.
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users with DMAPI enabled GPFS filesystem
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18600
- Problem description: When a filesystem is being quiesced (for example create or delete of a snapshot), a type of rpc labelled "RecLockRetry" can become hung and create a deadlock between nodes. Messages similar to these can be found in the output of the "mmdiag --waiters" command - RemoteRetryThread: on ThCond, reason 'RPC wait' for RecLockRetry - Msg handler RecLockRetry: for In function "RecLockMessageHandler(RecLockRetry)", call to "kxSendFlock"
- Work around: None
- Problem trigger: Applications running on different nodes in the cluster are contending for conflicting advisory (fcntl) locks on the same file and one of these is releasing its lock at a time when the filesystem is being quiesced (for example, create or delete of a snapshot).
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18676
- Problem description: In a file system a directory is being deleted on a node; from another node in the cluster, some operation generates a synchronous attempt to obtain a conflicting lock to the same directory. This may cause a kernel crash in the pathname look up procedure on the first node. This is a timing issue and difficult to hit.
- Work around: None
- Problem trigger: A large workload of recursive directory deletion.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18698
- Problem description: In rare case, unmounting a GPFS file system may cause "kernel BUG at dcache.c:966 - dentry still in use (-128)" on linux-3.12, The race happens between shrink_dcache_for_umount() and token revoke (or gpfsSwapd)
- Work around: None
- Problem trigger: Unmount GPFS file system, the kernel panic is a very rare case.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments linux-3.12
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18704
- Problem description: If the clock jumps forward more than 160 seconds, then the internal command "tsctl nQstatus -Y" will return a status of "unresponsive". This will trigger the gpfs_unresponsive event causing CES IPs to failover to other nodes.
- Work around: None. The transient "unresponsive" state is self corrected within a short period of time (10 seconds).
- Problem trigger: System clock jumps backward or forward by more than 160 seconds.
- Symptom: Unexpected Results/Behavior.
- Platforms affected: ALL Operating System environments
- Functional Area affected: GPFS
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18686
- Problem description: Deadlock on SGManagementMgrDataMutex could occur during buffer steal.
- Work around: None
- Problem trigger: Buffer steal triggered due to running low on free buffers or a change in token manager assignment.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18687
- Problem description: GPFS daemon assert: Assert exp(i < nServers). This could happen when the number of manager nodes in the cluster is more than the maxTokenServers configuration setting which defaults to 128.
- Work around: Either reduce number of manager nodes or increase the maxTokenServers setting.
- Problem trigger: Number of manager node exceeds the maxTokenServers setting.
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18707
- Problem description: The mmdf command hangs with a long waiter waiting for free space recovery causing other commands that conflict with mmdf to be blocked.
- Work around: None
- Problem trigger: Run mmdf command on FPO file system while there are I/O workloads in progress.
- Symptom: mmdf hang.
- Platforms affected: ALL Operating System environments except AIX and Windows
- Functional Area affected: FPO
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18710
- Problem description: tslspool34 is called by GPFSPool zimon sensor which may cause cluster wide token contention on root directory of a file system
- Work around: None
- Problem trigger: tslspool34 is called by GPFSPool zimon sensor which will run once every 5 minutes. Per 5.0.3, zimon sensor will run on all nodes which will cause much token contention. Since release 5.0.3.1 the zimon sensor runs only on restricted nodes
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: perfmon (Zimon)
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18690
- Problem description: FSSTRUCT error could be issued during file creation if file system runs out of disk space for metadata.
- Work around: None
- Problem trigger: Running out of metadata disk space
- Symptom: Error output/message
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18712
- Problem description: When IO is being run to AFM enabled filesets, the IO requests from the applicaiton nodes are queued to the gateway node by going through all the nodes in the cluster (including those remote cluster nodes that mount the local filesystem), to find the single gateway node for the fileset. On clusters having huge number of remote cluster mounted nodes, this causes a considerable application performance degradation.
- Work around: None
- Problem trigger: 1) Have a large number of remote cluster nodes mounting the file system from the owning cluster. (Thousands) 2) Run IO to the same AFM replication fileset from multiple client nodes in the cluster. 3) Compare the impreovements by performing same test with and without the patch.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (AFM gateway nodes). All Linux and AIX environments (Application nodes running IO to the AFM fileset).
- Functional area affected: AFM - NFS and GPFS backend filesets. with afmHashVersions 2 and 5. With afmFastHashVersion tunable turned on.
- Customer Impact: High Importance.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18694
- Problem description: When mmfsck is run on a file system having corruption in the inode allocation map the file system manager node of the file system being scanned can assert with - logAssertFailed: !"SeverityNone for FSTATUS_UNFIXED"
- Work around: Disable the fsck patch queue feature on all nodes using this command, "mmdsh -N all mmfsadm test fsck usePatchQueue 0", then rerun the mmfsck command.
- Problem trigger: This issue will affect customers running mmfsck on IBM Spectrum Scale V4.2.3 or higher where they have additional reserved inode marked free corruption in the inode allocation map.
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18714
- Problem description: When user identifier (UID) remapping is enabled IO performance is reduced due to the incorrect caching of the supplementary group identifiers (gids).
- Work around: None
- Problem trigger: Remote cluster mount with UID remapping enabled
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Operating System environments
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18701
- Problem description: AFM waits only 8 seconds for the mount of a remote file system to complete. This can be insufficient in environments with high latency networks.
- Work around: AFM supports a tunable, mountWaitTimeout, that defines the amount
of time AFM waits for a remote mount to complete. This tunable can be changed
with the command, "mmfsadm afm mountWaitTimeout
". However, this cannot be seen via the mmlsconfig command. A new global tunable, afmMountWaitTimeout has been added which can be set with the mmchconfig command and seen with the mmlsconfig command. - Problem trigger: Do AFM mount of remote file system via NFS over a network with high latency which will delay the mount for more than 8 seconds.
- Symptom: Network Performance.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM caching and AFM DR
- Customer Impact: Suggested.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18705
- Problem description: Unexpected GPFS daemon assert during file, directory, or symlink create operation.
- Work around: None
- Problem trigger: File system configuration changes such as enable/disable encryption
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18743
- Problem description: mmdeldisk or mmdf commands hang and wait on free space recovery. This happens when a node doesn't relinquish all the block allocation regions it owned during the process of unmounting a file system.
- Work around: Restart Spectrum Scale service on file system manager node.
- Problem trigger: Unmount the file system from a node.
- Symptom: File system or file operations hang.
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18711
- Problem description: Deadlock on an AFM cluster while changing the gateway node attribute using the mmchnode --gateway/--nogateway command.
- Work around: None
- Problem trigger: Gateway node change using the mmchnode command.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18713
- Problem description: Advisory locks are recorded in the Linux kernel on the local node via file_lock structures, and GPFS maintains an additional structure to accomplish locking across nodes. There are times when a blocked lock waiter is reset by GPFS during daemon cleanup process, the inode object is not freed and left in the slab cache. Later, GPFS may access the legacy inode structure data, which causes kernel crash.
- Work around: None
- Problem trigger: A large fcntl locking workload and daemon cleanup process.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18747
- Problem description: GPFS asserts during mmcheckquota command when it encounters invalid fileset ids in the quota file.
- Work around: None
- Problem trigger: Invalid fileset ids, likely originating from deleted files, were erroneously inserted into the quota file causing an assertion in the mmcheckquota command.
- Symptom: GPFS terminates during mmcheckquota command due to assertion.
- Platforms affected: ALL
- Functional Area affected: Quota
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18726
- Problem description: There is a minor performance degradation in queueing of AFM IO request from the application node to the gateway node due to an inefficient algorithm for identifying the correct AFM gateway node.
- Work around: There is no serious impact without the fix, only slower AFM IO performance.
- Problem trigger: Performing IO to a AFM fileset from a application node.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux and AIX OS environments (AIX as application nodes only)
- Functional Area affected: AFM caching and AFM DR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18728
- Problem description: The following assert occurs after changing cipherList to AUTHONLY without restarting daemon: logAssertFailed: secSendCoalBuf != __null && secSendCoalBufLen > 0
- Work around: None
- Problem trigger: cipherList is changed from a supported algorithm to AUTHONLY without restarting daemon and reconnect is attempted.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18518
- Problem description: Security hardening for the 'ts' commands in /usr/lpp/mmfs/bin/ .
- Work around: Remove the setuid from the files in the /usr/lpp/mmfs/bin directory.
- Problem trigger: Executing commands with certain undocumented input.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: admin commands
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18737
- Problem description: The mmdiag --afm command was executed to fill the data for AFM fileset status causing an assert because the command attempted to get information for unlinked filesets.
- Work around: None
- Problem trigger: Deleting the fileset.
- Symptom: Log assert is generated.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18739
- Problem description: Running "/usr/lpp/mmfs/bin/mmfsadm vfsstats" hits a diffPerCpuCounters seg fault due to NULL pointer dereference.
- Work around: None
- Problem trigger: Running "/usr/lpp/mmfs/bin/mmfsadm vfsstats".
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18757
- Problem description: On a file system without replication, it is possible for file system to panic with error 218 without additional information to help identify the disk that caused the error.
- Work around: None
- Problem trigger: Disk IO error
- Symptom: Cluster/File System Outage
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18742
- Problem description: The existing auto-recovery code does not handle descriptor only disks correctly and treats them as disks saving user data or metadata.
- Work around: Disable auto-recovery.
- Problem trigger: If auto-recovery is enabled and descOnly disks are configured in the cluster.
- Symptom: If auto-recovery is enabled and descOnly disks are configured in the cluster, when a node fails, auto-recovery will treat descOnly disks as data disks and might cause data replication downgrade.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: FPO
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18744
- Problem description: The NFS monitor checks the health state of a running NFS instance periodically. Sometimes the NFS service does not react on some "alive" check commands, and that is interpreted as a potential "hung" state. Based on the configuration in the mmsysmonitor.conf file either a failover or just a warning is triggered then.
- Work around: The behavior or a detected potential "hung" state can be customized with the flag 'failoverunresponsivenfs' in the mmsysmonitor.conf file, section [nfs]. The meaning of the flag value is: "true" = set an ERROR event (nfs_not_active) if NFS does not respond to NULL requests and has no measurable NFS operation activity "false" = set an DEGRADED event (nfs_unresponsive) if NFS does not respond to NULL requests and has no measurable NFS operation activity
- Problem trigger: In some cases high I/O load lead to the situation that NFS v3 and/or v4 NULL requests failed, and that a following internal statistics check reported no activity in respect to the number of internal NFS operations. These checks are done within a time span of several seconds to a minute. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily so that a failover would not be advised in this case. The monitor interprets the "unresponsiveness" as a potential "hung" state, and triggers either a failover or a warning, dependent on the configuration settings.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: Systemhealth
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18745
- Problem description: Get the following error message in mmfs.log: Unexpected data in message. Header dump: XXXXXXXX XXXX, and daemon may crash because LOGSHUTDOWN is called
- Work around: None
- Problem trigger: Bad network and reconnect is attempted.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18761
- Problem description: Accessing the .snapshots snaplink directory generates an I/O error, while creating or deleting snapshots for the same file system or fileset.
- Work around: Stop the process accessing the .snapshots directory after getting I/O error, then retry the access to it again.
- Problem trigger: This problem could be triggered by snapshot create and deletion operations.
- Symptom: I/O error
- Platforms affected: All Linux OS environments with kernel versions between 3.10.0-957.21.2 and 4.x.
- Functional Area affected: Snapshots
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18746
- Problem description: Restriping compressed files and then hit an assert on "wa" lock mode. This problem could only happen during restriping time on compressed files while these files are being truncated.
- Work around: Rerun the restripe command
- Problem trigger: Truncating compressed files while restripe is in progress.
- Symptom: Abend on restripe process
- Platforms affected: All
- Functional Area affected: File compression and restripe
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19132
- Problem description: Online fsck may report false positive lost blocks when started during file create/append workloads, and repairing such false positive lost blocks may result in data loss.
- Work around: Use offline fsck to repair lost blocks or to undo any false positive repairs done by online fsck.
- Problem trigger: Online fsck is run during file create/append workloads.
- Symptom: Operation failure due to FS corruption
- Platforms affected: ALL Operating System environments
- Functional Area affected: FSCK
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19629
- Problem description: Linux kernel crash
- Work around: None
- Problem trigger: The issue was with advisory locking via the fcntl() call Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19727
- Problem Description: Waiters: ccMsgGroupLeave: I/O timeout on read: I/O error on read:
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ19781
- Problem Description: Change LOGASSERT back to DEVASSERT: "openlog.C:91"
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ18518 IJ18593 IJ18595 IJ18599 IJ18600 IJ18612 IJ18676 IJ18686 IJ18687 IJ18690 IJ18694 IJ18698 IJ18701 IJ18704 IJ18705 IJ18707 IJ18710 IJ18711 IJ18712 IJ18713 IJ18714 IJ18726 IJ18728 IJ18737 IJ18739 IJ18742 IJ18743 IJ18744 IJ18745 IJ18746 IJ18747 IJ18757 IJ18761 IJ19047 IJ19049 IJ19126 IJ19132 IJ19240 IJ19629 IJ19630 IJ19639 IJ19727 IJ19781.
Problems fixed in IBM Spectrum Scale 4.2.3.17 [August 15, 2019]
- Problem description: Kernel may crash when retrieve file attributes
- Work Around: None
- Problem trigger: On a file system cluster, one file is locked by the application from one node; from another node in the cluster, some synchronous attempt to obtain a conflicting lock to the same file. This is a timing issue and difficult to hit.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ17981
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: There will be a long waiter like below: Waiting 8349.1305
sec since 00:03:05, monitored, thread 133060 AcquireBRTHandlerThread: on
ThCond 0x3FFE74012E78 (MsgRecordCondvar), reason 'RPC wait' for tmMsgBRRevoke
on node 192.168.117.82
- Work Around: None
- Problem trigger: race condition between handling an inbound connection and node joining
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ18007
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Advisory locks are recorded in the Linux kernel on the local node via file_lock structures, and GPFS maintains an additional structure to accomplish locking across nodes. There are times when inode object was freed, at the same time a blocked lock waiter is resumed by GPFS, GPFS will try to free the file_lock along with the GPFS structure, and access the obsolete inode structure data, which causes kernel crash.
- Work Around: None
- Problem trigger: A large fcntl locking workload and lock contention.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ18019
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmlsfileset command with "-i -d" options could run into an infinite loop when there is no enough free memory and indirect block descriptors in system. In addition, the similar loop issue could happen during mmrestripefs, snapshot deletion and ACL garbage collection processes.
- Work Around: Increase the maxFilesToCache to allow more indirect block descriptors in cache. Also make sure there's enough free physical memory in system.
- Problem trigger: Run mmlsfileset -i -d, snapshot delete and mmrestripefs commands, or enable ACL, when no enough free physical memory in system with default or low configuration for maxFilesToCache parameter.
- Symptom: The mmlsfileset, snapshot delete and mmrestripefs commands hang there and other mm* commands cannot proceed as well. The background ACL garbage collection thread is running in a loop if ACL is enabled.
- Platforms affected: All
- Functional Area affected: mmlsfileset, mmrestripefs, snapshot delete commands and ACL garbage collection process.
- Customer Impact: Critical IJ17943
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File system unmounted when application overwrite data blocks
- Work Around: None
- Problem trigger: Overwriting data block followed by disk down in the file system.
- Symptom: unmounted
- Platforms affected: All
- Functional Area affected: gpfs core
- Customer Impact: High Importance IJ17945
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: getxattr for 'security' namespace is not well blocked during quiesce that may cause assert "SGNotQuiesced"
- Work Around: None
- Problem trigger: When file system is quiesced (for example when run mmcrsnapshot/mmdelsnapshot), all vfs operations should be blocked. If there are applications which accessing file's 'security' namespace extended attributes (for example 'getcap' command), that getxattr vfs operation is not well blocked and may cause assert "SGNotQuiesced"
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: Snapshots
- Customer Impact: High Importance IJ17947
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Memory leak when the gateway node joins the cluster. Reply data is not freed after obtaining the lead gateway node. Lead gateway functionality is no longer used.
- Work Around: None
- Problem trigger: Gateway node joining the cluster.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ18021
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Memory leak when the gateway node is not yet ready to handle the requests when the node designation is changed
- Work Around: None
- Problem trigger: Gateway node joining the cluster.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ18022
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: After reboot of a node the systemhealth NFS monitoring was started, but not the SMB component and monitoring. AD authentication was configured for NFS, which depends on a running SMB component. This constellation yield to a "winbind-down" event, but gave no hint about the root cause
- Work Around: mmshutdown followed by mmstartup might help, since the entire stack (including SMB/NFS and their monitors) are restarted. The log level could be increased during the startup and check phase (mmces log level 3) to get more details in the mmfs.log file. For production, this log level should be lowered ( to 0 or 1).
- Problem trigger: The circumstances which may lead to the detected mismatch were not repeatable. This seems to be a rare race situation, and was not reported before.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: High Importance IJ18026
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Primary fileset might run out of inode space if large number of files are created/deleted.
- Work Around: None
- Problem trigger: Inode space might be exhausted.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM DR
- Customer Impact: IJ17948
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM prefetch on the small sized files have performance issue as the file is flushed to the disk without closing the open instance. This causes file not to be shrunk to fit into the subblocks and the full block of data is transferred to the NSD server.
- Work Around: None
- Problem trigger: AFM prefetch
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ18028
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A stripe group / file system manager panic occurs while another node (non-SGmgr) is accessing files in a snapshot. These accesses can be part of the snapshot deletion itself, or another maintenance command (such as mmdeldisk or mmrestripefs), or even ordinary user accesses from the kernel. The diagnostic error reported in the log on the stripe group (SG) manager node looks like this, though the line number may vary: 2019-05-06_23:23:22.122-0300: [X] File System fs1 unmounted by the system with return code 2 reason code 0, at line 4646 in /afs/apd.pok.ibm.com/u/gpfsbld/buildh/ttn423ptf13/src/avs/fs/mmfs/ts/fs/llio.C The "unmount in llio.C" message is usually followed by a message mentioning "Reason: SGPanic", but this does not always occur, and a SGPanic can be caused by other unrelated problems. The error is triggered by a snapshot listed as DeleteRequired by mmlssnapshot. The snapshot access that causes the error, however, will be to an earlier snapshot (with smaller snapId); though it may be difficult to determine which access or which node caused the panic. Further, at least one snapshot must be a fileset snapshot (file systems with only global snapshots, are not affected). The specific enabling factors, however, are complicated and quite rare for most customers, so this is not a common problem.
- Work Around: The work-around is to remove DeleteRequired snapshots with an mmdelsnapshot command with an explicit -N argument listing only the SG manager node.
- Problem trigger: The error is triggered by a snapshot listed as DeleteRequired by mmlssnapshot. The snapshot access that causes the error, however, will be to an earlier snapshot (with smaller snapId); though it may be difficult to determine which access or which node caused the panic. Further, at least one snapshot must be a fileset snapshot (file systems with only global snapshots, are not affected). The specific enabling factors, however, are complicated and quite rare for most customers, so this is not a common problem.
- Symptom: Cluster/File System Outage
- Platforms affected: ALL
- Functional Area affected: Snapshots
- Customer Impact: Suggested: has little or no impact on customer operation IJ17950
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Because of race between handling of memory mapping and reading same file normal read from the last block of that mapped file returned wrong data.
- Work Around: None
- Problem trigger: Multiple processes reading last block of a file and memory mapping same file.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: High IJ17956
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When running I/O with NFS unexpected failovers occurred without an obvious reason. NFS is reported as 'not active', even it is still working
- Work Around: No work around available. There is a manuel way to temporary modify the event declaration for the observed "nfs_not_active" event by modifying the event action in the event configuration file ( ask L2 for support).
- Problem trigger: In the reported cases some high I/O load lead to the situation that NFS v3 and/or v4 (whatever is configured) NULL requests failed, and that a following internal statistics check reported no activity regarding the number of internal NFS operations. The monitor interpreted this as a "hung" state and triggered a failover. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily, so that a failover is not advised in this case. However, at the time of monitoring there was no further indication available.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: Systemhealth
- Customer Impact: High IJ17951
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Customer cannot create a smb export under specific conditions.
- Work Around: Choose names of gpfs file systems while no file system is a substring of any other
- Symptom: Cluster is limited to special setup for his gpfs file systems
- Platforms affected: ALL Linux OS environments
- Functional Area affected: SMB
- Customer Impact: Suggested IJ18006
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM is unable to prefetch the data if the file metadata is changed. For example if the user changes the metadata (ex. chmod) on the uncached file, prefetch skips reading the file.
- Work Around: Read the file manually without the prefetch
- Problem trigger: AFM prefetch
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: AFM
- Customer Impact: High IJ17954
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A filesystem containing a dot in the name was declared as
to be ignored by declaring a file /var/mmfs/etc/ignoreAnyMount.
. However, the systemhealth monitor treated it as a missing filesystem. - Work Around: No work around available. Filesystems could be named with an underscore instead of a dot, if a separator is wanted
- Problem trigger: A filename /var/mmfs/etc/ignoreAnyMount.
is split internally by dots, so that it results in three items (which is not wanted): /var/mmfs/etc/ignoreAnyMount, filesystemWith, dot - Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: Systemhealth
- Customer Impact: little impact on customer operation IJ17955
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Kernel crash when home is not reachable and AFM fileset is going to Unmounted state. There is race between one thread which is freeing vfsP and another thread is dereferencing vfsP while playing operation to home from AFM queue.
- Work Around: None
- Problem trigger: Home disconnect and one operation in the queue.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: IJ18013
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When debugging is enabled for mmbackup, tsqosnice is called to query QOS and then tsqosnice may terminate with a stack smashing error.
- Work Around: Do not use mmbackup debugging or remove the call to tsqosnice from the mmbackup script.
- Problem trigger: See problem description.
- Symptom: Stack smashing error message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: QOS, mmbackup
- Customer Impact: Suggested IJ18010
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File was mapped with shmatt and attempted to read past eof. This caused pagefault. GPFS pagefault handler generated kernel panic when it found the buffer it is trying to transfer to user buffer is not valid. Crashed the node with the following error: kernel panic, assert !lcl._wasMapped)
- Work Around: None
- Problem trigger: Reading last block of a file mapped with shmat
- Symptom: Kernel panic
- Platforms affected: AIX
- Functional Area affected: ALL
- Customer Impact: High IJ18001
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Node crashed due to checking for wrong lock status of a mapped file
- Work Around: None
- Problem trigger: Reading mapped file.
- Symptom: Node crash in CXISYSTEM.C
- Platforms affected: All
- Functional Area affected: ALL
- Customer Impact: High IJ18002
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Gateway node gets remote assert while reading the data from home as it does not block the filesystem quiesce
- Work Around: None
- Problem trigger: Read operation of the uncached files on the AFM caching filesets.
- Symptom: Abend/Crash
- Platforms affected: All Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Critical IJ18209
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ17943 IJ17945 IJ17947 IJ17948 IJ17950 IJ17951 IJ17954 IJ17955 IJ17956 IJ17981 IJ18001 IJ18002 IJ18006 IJ18007 IJ18010 IJ18013 IJ18019 IJ18021 IJ18022 IJ18026 IJ18028 IJ18209.
Problems fixed in IBM Spectrum Scale 4.2.3.16 [June 27, 2019]
- Problem description: RPCs sending via RDMA are pending there forever and they
are in 'sending' state. Long waiters with Verbs RDMA like below: Waiting
2273.0813 sec since 11:05:04, monitored, thread 113229 BackgroundSyncThread:
for RDMA send completion fast on node 192.168.1.1
- Work around: No
- Problem trigger: Reply lost on RDMA network
- Symptom: Hang
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: High IJ16484
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: "mmuserauth service create" command failed due to TCP port 445 being blocked. However, error message indicated incorrect credentials which was not the correct reason for failure.
- Work around: None
- Problem trigger: The issue is seen at the time of configuring Authentication, in those setups where TCP Port 445 is blocked. The command internally tries to connect to the DC specified via the Port. Due to blocked port, it fails to connect with a timeout. However, the error message shown currently indicates of incorrect credentials which is not the case.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Authentication
- Customer Impact: Suggested: has little or no impact on customer operation IJ16507
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSErrInodeCorrupted FSSTRUCT error could be written to system log as result of stale buffer for directory block.
- Work around: None
- Problem trigger: Change in token manager list as result of either node failure or change in number of manager nodes.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ16531
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When mmrestripefs fails due to an I/O error caused by node failures, there is a chance that the assert is triggered.
- Work around: None
- Problem trigger: I/O errors caused by node failures while running mmrestripefs.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High IJ16533
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSErrCheckHeaderFailed error could be correctly issued and logged in the system log.
- Work around: None
- Problem trigger: User application move files out of directory before deleting the directory.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ16478
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Assert exp(totalLen <= extensionLen) in line 16424 of file /project/sprelttn423/build/rttn423s008a/src/avs/fs/mmfs/ts/nsd/nsdServer.C
- Work around: None
- Problem trigger: This issue affects customers running IBM Spectrum Scale 4.2.3 and later if the following conditions are ture 1) mixed-endianess cluster, or mixed-endianess remote clusters. 2) RDMA enabled (and NSD client may send NSD requests to a NSD server which has a different endianess) 3) NSD client or NSD Server is IBM Spectrum Scale 4.2.3 It's a rare case assert which may happen when the client sends the first NSD request to a NSD server which has different endianess.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: Suggested IJ16481
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmfsadm dump command could run into an infinite loop when dumping the token objects.
- Work around: avoid to run mmfsadm dump command.
- Problem trigger: run mmfsadm dump command while there're workloads running in cluster.
- Symptom: mmfsadm dump command hang.
- Platforms affected: ALL Operating System environments except Windows
- Functional Area affected: mmfsadm dump command
- Customer Impact: Suggested IJ16479
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If the file system was formatted with narrow disk address (2.2 version or older), and the gpfs version is 4.2.3 or 5.0.x version, GPFS daemon assert would happen randomly.
- Work around: None
- Problem trigger: Application I/O into a narrow disk address file system by using 4.2.3 or 5.0.x GPFS versions.
- Symptom: Crash, likes assert subblocksPerFileBlock==(1<<(tinodeP->getFblockSize()))
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High IJ16568
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmrepquota -q and -t option command usage is ambiguos. Options -q and -t should not be used when combined with Device:Fileset because they are file system attributes.
- Work around: None
- Problem trigger: The current mmrepquota command usage allows invoking -q option as follows: mmrepquota -q Device:fileset
- Symptom: mmrepquota -q Device:fileset gives file system default quota information and not perfileset-quota.
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: Suggested: has little or no impact on customer operation IJ16483
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The arping command is used by the NFS failover mechanism, but was not found on the system. It was installed, but the log files show a No such file or directory message, which indicates that the arping command was not found in the expected path.
- Work around: Probably it would help to set a symbolic link from the arping command to "/usr/bin/arping", which is the default if the distro could not be properly detected. Basically using links is not advised, since they could be a security issue.
- Problem trigger: The circumstances which lead to the issue is not fully understood. Most likely the OS detection using the /etc/redhat-release file detection did not work, so that the wrong distro was assumed, which lead to a wrong expected path name for the arping command location. So finally it was not found then. This older CentOS version does not yet have the /etc/os-release file provided by newer distros, which we use meanwhile, too.
- Symptom: Error output/message
- Platforms affected: All CentOS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: has little or no impact on customer operation IJ16487
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSErrInodeCorrupted FSSTRUCT error could be issued incorrectly during lookup when both directory and its parent directory are being deleted.
- Work around: None
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ16485
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: During file manager take over, the new manager will broadcast to all mount nodes to invalidate their cached low level file metadata. If at the same time, a low level file is being opened on the mount node, they have chance to race and causes logAssertFailed "ibdP->llfileP == this" or logAssertFailed "inode.indirectionLevel >= 1
- Work around: One of our customers reports they hit this problem while they run mmdelsnapshot. For mmdelsnapshot scenario, deleting the oldest snapshot first will greatly reduce the risk.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High IJ16486
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS admin commands may cause high CPU usage. This is due to remote GPFS command calls find command to cleanup tmp files on system with large number of subdirs and files under /var/mmfs/tmp.
- Work around: Manually cleanup to reduce the number of subdirs and files under /var/mmfs/tmp. Kill running find processes that invoked from /usr/lpp/mmfs/mmremote processes.
- Problem trigger: Nodes with large number of subdirs and files under /var/mmfs/tmp are mostly likely affected.
- Symptom: Performance Impact/Degradation Hang
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ16426
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM does not keep directory mtime in sync while reading the directory contents from the home. This may be a problem for some users during the migration
- Work around: None
- Problem trigger: AFM migration/prefetch or cache readdir/lookup
- Symptom: Unexpected results
- Platforms affected: All Linux OS
- Functional Area affected: AFM
- Customer Impact: Critical IJ16488
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: In mmcrfileset command, if the requested maxInodes is shrunk because it exceeds fs limitation, there is no appropriate attention message.
- Work around: None
- Problem trigger: mmcrfileset command request maxInodes which exceeds fs inode capacity
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ16489
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Some GPFS commands don't work correctly if the cluster name contains special characters.
- Work around: Change the name of the cluster so that it does not contain any special characters.
- Problem trigger: Cluster name with special character like the ampersand "&" causes command like mmauth show . to fail
- Symptom: GPFS admin commands error. Error output/message Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: admin commands
- Customer Impact: low IJ16620
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmlslicense --capacity fails to report the correct disk size
- Work around: Manually getting the disk size from blockdev command.
- Problem trigger: Underlying device names are not found on all NSD servers
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Admin Commands
- Customer Impact: Suggested: has little or no impact on customer operation IJ16614
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmfs.log file may contain an entry like this: "[E] sdrServ: Communication error on socket /var/mmfs/mmsysmon/mmsysmonitor.socket, [err 79] Can not access a needed shared library" The "Can not access a needed shared library" message is wrong at this place, because there was a network connection issue.
- Work around: N/A. The reported error code "79" is internally used, and means "connection refused".
- Problem trigger: No recreate procedure available for the reported issue. The underlying issue was, that GPFS internal codes were not mapped to Linux system codes. That gave the wrong message text when printing the corresponding system message text for such a code.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: System Health
- Customer Impact: has little or no impact on customer operation IJ16706
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When creating DMAPI session there is a small window where memory is getting corrupted causing GPFS daemon crash with sig 11.
- Work around: None
- Problem trigger: Creating lots of DMAPI sessions with heavy workload
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: DMAPI
- Customer Impact: Suggested: IJ16719
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A race condition may cause mmperfmon that update sensor fail with the following message: fput failed: Invalid version on put (err 807) Other commands fail with the above message as well.
- Work around: Rerun the failed command.
- Problem trigger: Problem hit more often using spectrum command to install.
- Symptom: Error output/message "fput failed: Invalid version on put (err 807)" Upgrade/Install failure
- Platforms affected: ALL Operating System environments but more often on Linux nodes in CCR environment.
- Functional Area affected: Admin Commands
- Customer Impact: Suggested: has little or no impact on customer operation IJ16722
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Trying to delete an immutable file through SMB fails after the retention period expires. The problem is that Samba as SMB server denies deletion when the READONLY flag is set.
- Work around: None
- Problem trigger: A Windows SMB client is trying to delete an immutable file after the retention period expires.
- Symptom: Error output/message
- Platforms affected: Windows Only
- Functional Area affected: SMB/Immutability
- Customer Impact: High Importance IJ16702
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon crash when application writing data into file system
- Work around: None
- Problem trigger: A memory faiure of newBuffer in a busy system.
- Symptom: Crash
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ16612
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A node in the home cluster hit the following assertion when a remote node joins the cluster: 2019-04-16_14:55:37.346+0200: [X] logAssertFailed: (nodesPP[nidx] == NULL || nodesPP[nidx] == niP)
- Work around: None
- Problem trigger: remote node joins and leaves the cluster
- Symptom: Crash/Abend
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High IJ16613
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Not checking session info length when creating DMAPI session which is supposed to be less than or equal to 256 bytes. As per DMAPI standards it needs to return E2BIG errno. Instead GPFS is truncating the length to 256 bytes and proceeding with the session creation.
- Work around: None
- Problem trigger: Creating DMAPI session with very long session info string
- Symptom: None
- Platforms affected: ALL
- Functional Area affected: DMAPI
- Customer Impact: Suggested IJ16704
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Not failing with err 22 when dm_getall_disp dmapi is called with bad sessionId
- Work around: None
- Problem trigger: When dm_getall_disp is called bad sessionId
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: DMAPI
- Customer Impact: Suggested IJ16705
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: There are lots of unacknowledged replies in the tscomm section of an internal dump
- Work around: No
- Problem trigger: message sequence no overflow
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: All Scale Users
- Customer Impact: High IJ17006
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A node in the cluster could hit the following assertion: logAssertFailed: tmCommVersion != 0
- Work around: None
- Problem trigger: Node failure happens when there are some token requests queued
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All Scale Users
- Customer Impact: High IJ17051
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: RPC message was reported as lost, like below: Message ID 15103094 was lost by node ip_address node_name wasLost 1
- Work around: No
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional Area affected: All Scale Users
- Customer Impact: High IJ17052
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File in the AFM fileset is not fully cached when read using the memory mapped IO causing the data mismatch. This is happening due to the incorrect context checking.
- Work around: Read the file without memory mapped IO.
- Problem trigger: When the uncached file is read using the memory mapped IO on the AFM fileset.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: AFM
- Customer Impact: HiPER IJ17053
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Intermittent hardware errors could occasionally prevent a pdisk from being properly identified, resulting in not all I/O paths found or pdisks being declared "missing". In all cases, the mmlspdisk command would show the pdisk with a WWN beginning with "md5."
- Work around: Stop and restart GPFS on the I/O node.
- Problem trigger: Transient hardware errors affecting SCSI Inquiry commands.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ppc64-linux x86_64-linux ppc64le-linux AIX
- Functional Area affected: ESS/GNR
- Customer Impact: High Importance IJ16988
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: On RHEL7.6 node, with supported GPFS versions 4.2.3.13 or higher and 5.0.2.2 or higher, when kernel upgrade to version 3.10.0-957.19.1 or 3.10.0-957.21.2 (after apply RHBA-2019:1337) or higher, the node may encounter a kernel crash while running an IO operations.
- Work around: disable selinux
- Problem trigger: An inconsistency between the GPFS kernel portability layer and the kernel level
- Symptom: Abend/Crash
- Platforms affected: RHEL7.6 with kernel 3.10.0-957.19.1 or higher
- Functional Area affected: All
- Customer Impact: High Importance IJ16794
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When the UID remapping is enabled, daemon asserts or the kernel crash occurs on the nodes in the client cluster. This happens when the remapping scripts does not remap any credentials or the enableStatUIDremap is not enabled.
- Work around: 1. For the daemon assert, correct the remap scripts to remap the credentials 2. For the kernel crash, enable enableStatUIDremap config option
- Problem trigger: UID remapping with incorrect mmname2uid script and file metadata modification when enableStatUIDremap is not enabled.
- Symptom: Abend/crash
- Platforms affected: All
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High Importance IJ16985
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Trying to clear the READONLY attribute of an immutable file through SMB succeeded within the retention period.
- Work around: None
- Problem trigger: A Windows SMB client is trying to clear the READONLY attribute on an immutable file that has not expired.
- Symptom: Error output/message
- Platforms affected: Windows Only
- Functional Area affected: SMB/Immutability
- Customer Impact: High Importance IJ16426
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ16426 IJ16478 IJ16479 IJ16481 IJ16483 IJ16484 IJ16485 IJ16486 IJ16487 IJ16488 IJ16489 IJ16507 IJ16531 IJ16533 IJ16568 IJ16612 IJ16613 IJ16614 IJ16620 IJ16702 IJ16704 IJ16705 IJ16706 IJ16719 IJ16722 IJ16794 IJ16985 IJ16988 IJ17006 IJ17051 IJ17052 IJ17053.
Problems fixed in IBM Spectrum Scale 4.2.3.15 [May 9, 2019]
- Problem description: GPFS daemon on cluster manager node may assert when excessive timer drift issue occurs between the cluster manager node and other cluster nodes,
- Work Around: None
- Problem trigger: On a file system cluster, the large difference in the rate of clock ticks between the cluster node and the cluster manager node is detected by cluster manager.
- Symptom: Daemon Assert
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15276
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: ps command was hung because of some mmap locking order between OS and GPFS mmap operations.
- Work Around: None
- Problem trigger: Writing to a file with destinations to a mapped file
- Symptom: Node hang
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15464
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS stays in arbitrating state after node reboot
- Work Around: Restart GPFS on the nodes where GPFS is stuck in the arbitrating state
- Problem trigger: Rebooting a large number of nodes may result in GPFS staying in the arbitrating state on some of the nodes, due to the nodes having trouble connecting to the cluster manager.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ15272
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Daemon crashes during restriping file system.
- Work Around: None
- Problem trigger: The problem was that some internal locks were not released in error condition, resulted in internal check failed and daemon crashed.
- Symptom: Daemon crashed
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Middle IJ15347
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Occasionally a snapshot problem makes it impossible to delete a snapshot due to persistent errors, such as "FSSTRUCT ...116" in the system log. The delete snapshot command may fail with exit code 5 and the failure noted in the GPFS log file is error 214, such as "[E] Command: err 214: mmdelsnapshot ..."
- Work Around: None
- Problem trigger: N/A
- Symptom: Operation failure due to FS corruption
- Platforms affected: All
- Functional Area affected: Snapshots
- Customer Impact: Suggested: has little or no impact on customer operation IJ15273
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Unexpected file system unmount due to SGPanic with return code 301.
- Work Around: None
- Problem trigger: Node failure such as file system panic, expel, gpfs daemon assert.
- Symptom: Cluster/File System Outage
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High IJ15315
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File size might mismatch between the cache and home or primary and secondary if the remote cluster mount panics at the gateway node while using the NSD protocol.
- Work Around: None
- Problem trigger: AFM recovery
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: High IJ15358
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Since Linux v4.16, kernel will print warning message "_ib_modify_qp: rq_psn overflow, masking to 24 bits" when RDMA is enabled by GPFS
- Work Around: None
- Problem trigger: On kernel v4.16 or later where RDMA is enabled
- Symptom: warning message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: Suggested: IJ15274
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Close message is used at the gateway node to reset the dirty bit on the inode for the memory mapped files. Dirty bit is reset after acquiring the exclusive lock on the inode. If the application is not allowing to get the exclusive lock, it might block the quiesce causing the cluster wide deadlock.
- Work Around: None
- Problem trigger: A region of the memory mapped file is unmapped with still locks held.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: AFM
- Customer Impact: Critical IJ15275
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: User can't run any GPFS command on the file system due to panic on the file system manager. This can occur when number of failure group with down/unrecovered disks is exceeds replication requirement.
- Work Around: None
- Problem trigger: Additional disk failure after file system already unmounted due to too many disk have become unavailable.
- Symptom: Cluster/File System Outage
- Platforms affected: ALL Operating System environments
- Functional Area affected: All
- Customer Impact: High IJ15277
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM Recovery is very slow when there are large number of dirty directories. AFM reads the remote parent inode by reading the dirtyDirs file which is opened every time when remote parent inode is required causing the performance overhead.
- Work Around: None
- Problem trigger: AFM recovery
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: High Importance IJ15436
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Currently the path and configuration file names for the hadoop connector are hard coded in the System Health monitor. Newer versions of the hadoop connector may have different default path settings, which would not work with the current setup.
- Work Around: Adjust path settings for newer hadoop connector installations to match with older (current) releases Probably the use of symbolic links would be needed to make programs and files accessible in the expected path.
- Problem trigger: Installation of other (higher) hadoop connector version might not work properly when default path settings are different than from the older version.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux More specific: CES Nodes running hadoop connector monitoring
- Functional Area affected: System Health
- Customer Impact: Affects upgrades of the hadoop connector module and upgrades to 5.0.x IJ15362
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Fix a problem in which we could not open an encrypted file. IJ15278
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Single heap on AIX may cause performance or thread blockage problems.
- Work Around: GPFS configuration parameter envVar can be set or use the -E option of mmstartup command.
- Problem trigger: Single heap on AIX may cause performance or thread blockage problems.
- Symptom: Performance Impact/Degradation
- Platforms affected: AIX/Power only
- Functional Area affected: All
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ15105
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon assert going off: exp(fromNode != regP->owner) after file system panic
- Work Around: None
- Problem trigger: File system panic on a node with the file system mounted.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15813
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: while performing unlink operation on the fileset, this is waiting for the inflight requests to be finished but these requests are stuck and not being killed which should have been happened.
- Work Around: None
- Problem trigger: AFM and AFM-ADR fileset unlink
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Suggested IJ15844
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM does not support punch hole and it is documented. In some cases punch hole on the AFM filesets is allowed using the DMAPI API and the operation is not replicated to home or secondary
- Work Around: None
- Problem trigger: Punch hole using DMAPI API
- Symptom: Unexpected results.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: High Importance IJ15818
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Node is expelled from the cluster because a message is reported lost after reconnect, like the message below: Message ID 2449 was lost by node IP_ADDR NODE_NAME wasLost 1
- Work Around: None
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15845
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM cache might not read the migrated files properly if the home have migrated files and AFM is not enabled at the home. File will not be fully cached.
- Work Around: None
- Problem trigger: Reading the migrated files from the home and the AFM is not enabled at home.
- Symptom: Unexpected results
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Critical IJ15866
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Daemon assert going off: !(sanGetHyperAllocBit() && hasAFragment() && ofP->isRootFS()), resulting in a daemon abend.
- Work Around: In gpfs 5.0 and above, disable the assert. mmchconfig diableAssert='metadata.C;mnode.C;sanSetFileSizes.C' In old gpfs, disable the dynassert: mmchconfig dynassert='sanergy 0'
- Problem trigger: Users having applications which append to the same GPFS file from multiple nodes are potentially affected.
- Symptom: Abend/Crash
- Platforms affected: Windows only
- Functional Area affected: All
- Customer Impact: High Importance IJ15432
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM recovery might prevent the filesystem quiesce to happen thus causing the timeout on the management commands like create snapshot etc..
- Work Around: None
- Problem trigger: Unresponsive home or secondary
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical IJ15769
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Ganesha server does not allow read of file if GPFS is mounted as read only on the Ganesha server.
- Work Around: None
- Problem trigger: GPFS is mounted as read only on the Ganesha server and data access from NFS client.
- Symptom: Data access failure from NFS client
- Platforms affected: Linux Only
- Functional Area affected: Ganesha
- Customer Impact: High Importance IJ15770
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: CLOUDGATEWAY will not show any other health status than CHECKING in mmhealth and therefore in the GUI as well.
- Work Around: None
- Problem trigger: Customer running 4.2.3 might be affected if they have a cloud gateway configured.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: GUI, System Health
- Customer Impact: High Importance IJ15855
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The NFS/Ganesha was hung, but the Systemhealth monitor did not trigger a failover of the CES-Ips.
- Work Around: Stop (kill) the ganesha process and restart it using mmces service start nfs if it it not restarted automatically after a few seconds
- Problem trigger: The reason for the Ganesha hung is unclear. The Systemhealth monitor showed that NULL requests to the NFS server failed but other checks still show that Ganesha was partially running. No failover was triggered. The expectation was that an IP failover would occur in this case.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: CES Nodes running NFS service
- Functional Area affected: System Health
- Customer Impact: High Importance IJ15856
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If AFM sees operations to be stuck on one of the filesets queue (while playing to remote site) - there is a bail out mechanism to kill such stuck operations and restart them. The deadlock happens on NSD protocol backend alone, when AFM tries to kill the stuck operations and restart them.
- Work Around: None
- Problem trigger: When AFM replication is happening from the cache/primary site to the remote/home site over NSD protocol (for more than one filesets) - and if there is latencies due to bad network or slowness in the remote cluster, then AFM tries to kill the stuck operations on the cache/primary site - resulting in a possibility to hit the mentioned deadlock.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters. Clusterwide deadlock and is seen that everything is stuck. Only for AFM replication happening over NSD protocol.
- Platforms affected: ALL Linux OS environments (AFM Gateway nodes only).
- Functional Area affected: AFM - and Specifically users on AFM replication over NSD protocol only.
- Customer Impact: High Importance IJ15753
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Got an error message in mmfs.log, like below: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000007 00000000 00 00 0000 00000000 00000000 0000 0000 Then, daemon was shutdown due to LOGSHUTDOWN called
- Work Around: None
- Problem trigger: Network is not good which leads to packets corruption
- Symptom: Abend/Crash
- Platforms affected:All
- Functional Area affected: All
- Customer Impact: High Importance IJ15848
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Some vendor's have a drive serial number that is padded with blanks at the beginning. This causes an "awk" error when replacing a drive with the mmchcarrier command. Functionally the operation succeeds, but this ugly awk message is displayed. You will see a message like: mmchcarrier : [I] Preparing a new pdisk for use may take many minutes. Attempting to update firmware if necessary. Failure will not prevent drive replacement. awk: fatal: cannot open file `8DG5P96Z' for reading (No such file or directory) Command: err 0: mmchfirmware --type drive --serial-number 8DG5P96Z --new-pdisk
- Work Around: None
- Problem trigger: This problem is caused by a drive serial number that is prepended with blanks and will only be seen when using the mmchcarrier command.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments This has only been observed on a Lenovo DSS configuration.
- Functional Area affected: ESS/GNR
- Customer Impact: Suggested IJ15762IJ15762
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Segmentation fault happens on mmfsd daemon when running mmdiag command with "--all" or "--threads" options on AIX system.
- Work Around: Stop to run mmdiag command with "--all" or "--threads" options.
- Problem trigger: mmdiag command with "--all" or "--threads" options.
- Symptom: Daemon crash
- Platforms affected: AIX
- Functional Area affected: mmdiag command with "--all" and "--threads" options.
- Customer Impact: Suggested IJ15867
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File may be evicted from the cache and read again after the closing the memory mapped file which was opened for the writing because remote attributes are not updated correctly.
- Work Around: None
- Problem trigger: When file is written using the memory mapped IO in AFM caching modes
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Critical IJ15767
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon will sig11 or log assert with "offset < ddbP->mappedLen" when user application, log recovery, tsdbfs or mmfsck command access a corrupted directory (directory's file size is smaller than 32 Bytes - the size of directory block header structure).
- Work Around: None
- Problem trigger: This kind corrupted directory could be caused by previous code bug.
- Symptom: Abend/Crash
- Platforms affected: All
- Customer Impact: High IJ15847
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmfileid command failed with return code 56: Operation already in progress
- Work Around: Run the same command again.
- Problem trigger: Uninitialized field fiduCheckRunning in SGMgrData constructor causes runTSFileIDCmd thinking that the command is already running.
- Symptom: mmfileid command fails with return code 56.
- Platforms affected: All
- Customer Impact: Suggested IJ15851
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS would have pretty long time waiters in the process of mmrestripefs or mmdeldisk, if there's down disks in the file system.
- Work Around: None
- Problem trigger: The file system data is replicated (-m and/or -r larger than 1), application write data into the file system while there is disk down.
- Symptom: long waiter
- Platforms affected: All
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ15863
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Not failing with err 22 when dm_getall_disp dmapi is called with bad sessionId
- Work Around: None
- Problem trigger: When dm_getall_disp is called bad sessionId
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: DMAPI
- Customer Impact: Suggested IJ15105
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ15105 IJ15272 IJ15273 IJ15274 IJ15275 IJ15276 IJ15277 IJ15278 IJ15315 IJ15347 IJ15358 IJ15362 IJ15432 IJ15436 IJ15464 IJ15753 IJ15762 IJ15767 IJ15769 IJ15770 IJ15813 IJ15818 IJ15844 IJ15845 IJ15847 IJ15848 IJ15851 IJ15855 IJ15856 J15863 IJ15866 IJ15867.
Problems fixed in IBM Spectrum Scale 4.2.3.14 [March 21, 2019]
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Assert going off: doneRepDAP != NULL
- Work Around: None
- Problem trigger: One possible scenario is creating many files in a directory that would cause the GPFS log to be full.
- Symptom: Abend/Crash
- Platforms affected: all
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ13559
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Hit signal 11 or log assert during the process of mmshutdown
- Work Around: None
- Problem trigger: mmfsd ended abnormally
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: High Importance IJ14148
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A file in CCRs committed directory gets truncated to zero length after a VM restart and a former VM hard power off (without notification) in conjunction with xfs (local file system) and GPFS active while the hard power off happens.
- Work Around: Whenever possible, avoid powering off a VM hard. Instead using an ACPI Shutdown or wait at least 60 seconds before powering off the VM hard.
- Problem trigger: The GPFS mmfsd is using the PF_MEMALLOC flag to avoid seen deadlocks during page memory management. Unfortunately xfs contains a sanity check to this flag, while writing a dirty buffer to disk. In case the flag is set, xfs refuses writing this dirty buffer and this leads to the empty file in CCRs committed directory, although the fsync system call the CCR is doing (for file updates) returned successfully.
- Symptom: Cluster outage in case not enough good copies of the affected file available.
- Platforms affected: Seen only on Linux based systems using xfs for the local file system
- Functional Area affected: CCR
- Customer Impact: Critical IJ14149
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Spurious false alerts in mmsysmon, may cause CES failover
- Work around: n/a
- Problem trigger: RHEL <7.4 or SLES 12/15 + timing window
- Symptom: Unexpected Results/Behavior
- Platforms affected: RHEL <7.4 or SLES
- Functional Area affected: System Health
- Customer Impact: Suggested IJ14150
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: In very rare cases, after a file system manager fails with SGPanic, it is possible for SGExceptionLocalPanicThread to be stuck on "waiting for new SG mgr". This will in turn prevent the file system been mounted on the node.
- Work around: Mount file system on other nodes first then issue mmchmgr command to force file system manager assignment to another node.
- Problem trigger: Newly appointed file system manager fail with SGPanic and a cluster manager failure at around the same time as file system manager failure.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14165
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Pagefault handler blocked due to mailbox resource contention and caused deadlock
- Work around: Increase number of worker1threads
- Problem trigger: heavy mmap workload
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ13556
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Hit the following log assert: Assert exp(t < now + TimeStamp(lsP->leaseDuration/2)
- Work around: None
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14193
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Daemon crash while uncached files are being read on the AFM filesets
- Work around: None
- Problem trigger: Reading the uncached files on the AFM filesets
- Symptom: Crash
- Platforms affected: All
- Functional Area affected: AFM
- Customer Impact: High Importance IJ14195
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Long waiter on Msg handler ccMsgExpelYourself, like below: Waiting 111838.8162 sec since 15:38:51 (-1 day), monitored, thread 19387 Msg handler ccMsgExpelYourself: on ThCond 0x3FFDCC118230 (MsgRecordCondvar), reason 'RPC wait' for commMsgDisconnec
- Work around: None
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14199
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: fsstruct error is logged or hit assert on mmfsd daemon when reading compressed files data from snapshot.
- Work around: Avoid the read from snapshot or decompress the file in active file system before reading from snapshot.
- Problem trigger: The file has been compressed and tries to read the data from snapshot for this file.
- Symptom: fsstruct error or daemon assert
- Platforms affected: All
- Functional Area affected: File compression
- Customer Impact: High Importance IJ13557
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If command has exit code not 0, the output is going to stdout instead stderr
- Work around: No
- Problem trigger: Wrong command use.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: System Health
- Customer Impact: Suggested IJ14208
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Fileset is dirty state and IW failback is started on that fileset and failed to execute but fileset stuck in FailbackInProg state forever.
- Work around: None
- Problem trigger: Fileset is dirty state and IW failback is started on that.
- Symptom: Cache state stuck in FailbackInProg state forever.
- Platforms affected: Linux only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ14206
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Application performance degradation on reaching the soft memory limit due to the throttling of incoming changes for starting the new flush threads.
- Work around: Increase the afmHardMemThreshold
- Problem trigger: Memory usage at the gateway node crosses the soft memory limit (40% of afmHardMemThreshold)
- Symptom: Performance Impact/Degradation
- Platforms affected: All
- Functional Area affected: AFM
- Customer Impact: High Importance IJ14216
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Race condition where attributes are getting build in async lookup for deleted file and it logged the error E_NOATTR.
- Work around: None
- Problem trigger: When file is getting delete and lookup operation is queued for the file.
- Symptom: logged error E_NOATTR for the file.
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ14217
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon assert with expression: Assert exp(holdCount > 0) in quota enabled file system.
- Work around: None
- Problem trigger: Assertion caused by a race between openFile asyncSteal thread (asyncStealWorkerBody) and file system close (endUseInternal @SFSCloseFSFinish).
- Symptom: In very rare timing case, file system manager encounters assertion "exp(holdCount > 0)" and restart. There are many situations that could lead to this assertion, but this case is specific to quota file structure being stolen by cleanup.
- Platforms affected: All
- Functional Area affected: GPFS
- Customer Impact: High Importance IJ14243
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Assert going off: logAssertFailed: mutexEventWordP == NULL
- Work around: None
- Problem trigger: Token transfer happens as nodes leaving or joining a cluster. There is a tiny window in which the threads working on the task can exit in the correct order, leading to the assert.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14218
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: On a GPFS node with LROC enabled, it is possible to GPFS daemon to be killed due to assert or signal 11 during buffer steal.
- Work around: Problem can be avoided by disable LROC.
- Problem trigger: This problem could occur when buffer steal is triggered on a LROC enabled GPFS node.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: LROC
- Customer Impact: High Importance IJ14219
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: For command mmapplypolicy, with option choice-algorithm fast and when a policy EXTERNAL rule contains an ESCAPE clause, the ESCAPE clause is seemingly ignored.
- Work around: Omit choice-algorithm option
- Problem trigger: mmapplypolicy and choice-algorithm fast and EXTERNAL ESCAPE and -N and -g (which became default with 5.x.y)
- Symptom: ESCAPE '%' is ignored, file lists with the default \n escapes.
- Platforms affected: All
- Functional Area affected: Policy/ILM
- Customer Impact: Customers using mmapplypolicy, --choice-algorithm, ESCAPE IJ13562
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS log assert "(offset < ddbP->mappedLen)" while accessing corrupted data in inode directory.
- Work around: Run offline fsck
- Problem trigger: GPFS defect D.1054097 may corrupt data in inode directory. Accessing this kind directory will log assert. It will abend the mmfsd daemon or crash the kernel.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ13560
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A Windows SMB client is trying to overwrite an immutable file that is within the retention period. Windows Explorer tries to access the file for deletion, which is denied. Then Windows Explorer tries to clear the READONLY attribute; this request returns SUCCESS, but the flag is not cleared. After this, the deletion is attempted again, then the attempt to clear the flag, resulting in a Windows Explorer stuck in a read only loop.
- Work around: None
- Problem trigger: A Windows SMB client is trying to overwrite an immutable file that is within the retention period.
- Symptom: Stuck
- Platforms affected: Windows Only
- Functional Area affected: SMB/Immutability
- Customer Impact: High Importance IJ13566
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Unable to restore a migrated file from snapshot. This is because the file was deleted from active file system but the data of deleted file didn't copy to the snapshot.
- Work around: None
- Problem trigger: DMAPI is enabled and file system snapshot is used, then deleting a migrated file can trigger this problem.
- Symptom: unable to restore a migrated file from snapshot for a deleted file in active file system.
- Platforms affected: All
- Functional Area affected: DMAPI
- Customer Impact: High Importance IJ13561
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon asserted when handling DMAPI Destroy event under certain conditions
- Work around: Do not set dmapi destroy event disposition
- Problem trigger: An assert happens when generating dmapi destroy event during file deletion if that file have attributes and file is not in cache.
- Symptom: GPFS daemon asserts
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical IJ13568
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Queue is in dropped state and error was set with E_INPROGRESS and resetting this error while existing from resync operation to make fileset active and hit the assert.
- Work around: None
- Problem trigger: Resync is dropped and error is set to E_INPROGRESS.
- Symptom: Hit assert
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ14220
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Resync/failover or "changeSecondary --inband" commands runs slower if the network latency is high between cache/primary and home/secondary.
- Work around: None
- Problem trigger: Resync with high network latency between cache/primary and home/secondary.
- Symptom: Performance impact
- Platforms affected: Linux Only
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical IJ14221
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Deadlock at the gateway node when files are getting created and deleted in the short span of the time due to the message filtering.
- Work around: None
- Problem trigger: File creation and deletion in the short span of the time.
- Symptom: Deadlock
- Platforms affected: ALL Operating System environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ14225
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: During the AFM resync or AFM DR "changeSecondary --inband" command write fails with the error 2 intermittently causing the queue to be dropped. This causes the resync to be started again and might go in loop.
- Work around: None
- Problem trigger: Fileset resync
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Critical IJ14230
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: While performing conversion, list files are not available for user on failure where user can see the listed output files which has been generated by policy.
- Work around: None
- Problem trigger: When conversion is happening to AFM DR or disable AFM filesets. It doesn't show any list files on command failure.
- Symptom: Logged reason on command failure without list files.
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ14234
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Memory corruption on heap memory
- Work around: None
- Problem trigger: Invalid parameter of 'mm' command
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14235
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: With QOS enabled, deadlock might occur during mount, particularly if the manager node is heavily loaded.
- Work around: Don't enable QOS or don't do unnecessary mounts or don't overload manager node(s).
- Problem trigger: See problem description
- Symptom: Deadlock. Tracebacks on the manager node will show two threads in QosIoMon::startManager(unsigned int)
- Platforms affected: All
- Functional Area affected: GPFS core functionality with QOS enabled.
- Customer Impact: Low IJ13563
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The daemon crashed on assert that the extended attribute should have been cleared before destroying the file.
- Work around: Avoid mmrestripefs/mmdeldisk commands while deleting the user files.
- Problem trigger: Deleting user files while mmrestripefs/mmdeldisk commands are running.
- Symptom: Daemon crash
- Platforms affected: All
- Functional Area affected: File deletions and file system restripe.
- Customer Impact: High Importance IJ14236
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When AFM cache is accessed using the NFS, changes from the home are not pulled into the cache on time.
- Work around: None
- Problem trigger: Accessing the AFM cache using NFS.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ13565
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmrepquota -t -Y sometimes returns wrong blockGrace or fileGrace values
- Work around: None
- Problem trigger: Error in printing the command output. The blockGrace and fileGrace values are UInt32 type but they are printed with %lld format in the mmrepquota -t -Y output causing occasional wrong values being printed.
- Symptom: Sometimes, the value returned by "mmrepquota -t -Y" are seemingly wrong.
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: Suggested IJ14237
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM fileset cannot be converted from independent-writer mode to single-writer mode without unlinking the fileset.
- Work around: Unlink and convert the fileset mode
- Problem trigger: Mode conversion
- Symptom: Component Level Outage
- Platforms affected: All
- Functional Area affected: AFM
- Customer Impact: High Importance IJ13564
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Memory leak in the kernel when nonempty directories are deleted during the replication. AFM renames the locally deleted but nonempty directories at the home or secondary during the replication. These nonempty directories are moved to the .afmtrash directory under the fileset root.
- Work around: None
- Problem trigger: Nonempty directory deletion.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ14238
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If a command has ? as last parameter it creates different output depending of the content of the folder where the command is executed.
- Work around: None
- Problem trigger: Files starting with a number in the same folder where the command is executed.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: SMB
- Customer Impact: Suggested IJ14239
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmkeyserv may fail if the current working directory is on NFS mounted file system.
- Work around: Make sure the CWD is not on NFS mounted file system.
- Problem trigger: mmkeyserv may fail if the current working directory is on NFS mounted file system.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: Admin Commands and Encryption
- Customer Impact: Suggested IJ14240
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Deadlock between GPFS daemon WritebehindWorkerThread and ganesha nfs daemon.
- Work around: Restart GPFS daemon on affected node.
- Problem trigger: Remove the file before all writes to it are flushed to disk so the file is not in use thru kernel VFS interface. While GPFS is flushing data to disk, ganesha nfs daemon tries to access the file.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environment
- Functional Area affected: NFS
- Customer Impact: High Importance IJ13567
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: With verbsRDMA enabled, UID and GID are not set correctly on the newly created files. UID and GID are set to 0 always causing the permission denied error to the users at home or secondary sites.
- Work around: Diable verbsRDMA
- Problem trigger: verbsRDMA enabled with AFM replication
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environment
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical IJ13708
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon crashed.
- Work around: None
- Problem trigger: The problem was that we failed to allocate a new data block for the write, so the write to the block failed but the buffer was marked to delayed zero out before the getNewDataBlock call in the modifyBuffer, the buffer was supposed to be cleanup up in the flush. But in this defect, application continuing writing other blocks and succeeded, so this buffer will not be cleaned in further flush. Back in this defect, the application tried to truncate the file and found the block has dirty data but no real block allocated.
- Symptom: GPFS daemon hit assert when truncate file
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14430
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: SMB CLI Help shows wrong default value for smb option "posix locking"
- Work around: None
- Problem trigger: None
- Symptom: Documentation Problem
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: Suggested IJ14271
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: For command mmapplypolicy, with option choice-algorithm fast and when a policy MIGRATE rule contains a THRESHOLD clause that ought to provoke PREMIGRATEs, we only see MIGRATEs.
- Work around: Omit choice-algorithm option
- Problem trigger: mmapplypolicy and choice-algorithm fast and MIGRATE THRESHOLD(x,y,z) and -N and -g (which became default with 5.x.y)
- Symptom: No PREMIGRATEs, only MIGRATEs, even when THRESHOLD(0,100,0) which ought to MIGRATE no files and PREMIGRATE all selected files.
- Platforms affected: All
- Functional Area affected: Policy/ILM
- Customer Impact: Customers using mmapplypolicy, --choice-algorithm, (PRE)MIGRATE THRESHOLD(x,y,z) with HSM or HSM-like ILM policies. IJ14043
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Inappropriate ENOLCK returns from fcntl(F_SETLK) calls.
- Work around: The condition here is where the fcntl token becomes fragmented (due to revoke requests) and not reclaimed on those revokes. Since there is also reclaim logic on lock/token acquires, repeating thse failed fcntl calls will eventually reclaim the space (by approximately 200 ranges at a time).
- Problem trigger: A large fcntl range is locked on a node and then released; then locking of multiple sub-ranges on a second node causes the original token range to become fragmented in excess of the maxFcntlRangesPerFile setting plus the number of ranges that can be reclaimed on a single call (typically 200).
- Symptom: Error
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ14044
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If customer defines alias names for gpfs nodes it may occure that mmsmb commands do not work from non ces nodes.
- Work around: Remove alias names from the network configuration.
- Problem trigger: Ambiguous name resolution of ces nodes.
- Symptom: If mmsmb command is executed from none ces nodes an error is returned saying that no healthy ctdb is available. GUI is effected if it runs on a none ces nodes.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: SMB
- Customer Impact: High Importance IJ14426
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Performance degradation of small file creates when multiple threads or processes create different files concurrently on one or more nodes.
- Work around: Set configuration parameter "maxActiveIallocSegs" to 8 or larger.
- Problem trigger: Concurrent create workload.
- Symptom: Performance Impact/Degradation.
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ14051
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If the country code is not set in the settings of "mmcallhome info list", the daily/weekly scheduled uploads will not be actually sent, but will be listed in "mmcallhome status list" as sent.
- Work around: Properly setup the country code in the call home settings, e.g.: "mmcallhome info change --country-code US"
- Problem trigger: Running scheduled Call Home data collection (also on demand by "mmcallhome run GatherSend --task daily"), when no country code is setup in call home settings.
- Symptom: On the IBM ECuRep server no uploads arrive, while on the cluster "mmcallhome status list" shows, that everything worked properly.
- Customer Impact: Suggested IJ14280
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Multiple nodes may deadlock during the recovery of a node failure.
- Work around: None
- Problem trigger: A failure of one node in the cluster (this will automatically initiate recovery processing where the problem can occur).
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: Core
- Customer Impact: High Importance IJ14464
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ13556 IJ13557 IJ13559 IJ13560 IJ13561 IJ13562 IJ13563 IJ13564 IJ13565 IJ13566 IJ13567 IJ13568 IJ13708 IJ14031 IJ14044 IJ14043 IJ14051 IJ14148 IJ14149 IJ14150 IJ14165 IJ14193 IJ14195 IJ14199 IJ14206 IJ14208 IJ14216 IJ14217 IJ14218 IJ14219 IJ14220 IJ14221 IJ14225 IJ14230 IJ14234 IJ14235 IJ14236 IJ14237 IJ14238 IJ14239 IJ14240 IJ14243 IJ14271 IJ14280 IJ14426 IJ14430 IJ14464.
Problems fixed in IBM Spectrum Scale 4.2.3.13 [January 24, 2019]
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Assert going off: vinfoP != NULL
- Work Around: None
- Problem trigger: A mixture of workload was run when the problem happened.
- Symptom: Abend/Crash
- Platforms affected: all
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ12050
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Deadlock
- Work Around: None
- Problem trigger: Manager nodes going down while some of the manager nodes are low in memory.
- Symptom: Deadlock
- Platforms affected: all
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ12197
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Failure to allocate a buffer caused by a small pagepol failed to mount a FS. This occurs when mounting a FS while heavy workload is consuming the pagepool.
- Work Around: Increase the pagepool size.
- Problem trigger: Mount a FS while heavy workload is running.
- Symptom: Error output/message.
- Platforms affected: ALL Operating System environments
- Functional Area affected: Filesets
- Customer Impact: High Importance IJ12199
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Assert or segmentation fault
- Work Around: None
- Problem trigger: Manager nodes going down while some of the manager nodes are low in memory in a cluster hosting multiple file systems.
- Symptom: Abend/Crash
- Platforms affected: all
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ12204
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmafmctl command uses short hostname for the gateway. This might cause problems if the nodes from different domains are added to the cluster and short hostanme evaluated by the daemon might not match the actual hostname
- Work Around: Add short hostname to the /etc/hosts file
- Problem trigger: AFM is used in a cluster with inconsistent or different domain names.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux
- Functional Area affected: AFM
- Customer Impact: High Importance IJ12094
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: EAGAIN and EWOULDBLOCK error from fflush() system call would cause unexpected file system unmount.
- Work Around: Retry operation after remount the file system. User may need to reboot the node in order to remount the file system.
- Problem trigger: Unknown.
- Symptom: Cluster/File System Outage
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High Importance IJ11996
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: CLI command mmces events list -Y does not work.Without option -Y it works.
- Work Around: use human readable version (without option -Y)
- Problem trigger: always
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ12206
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: For bandwidth checks, mmnetverify does not honor the cluster configuration value for tscCmdPortRange, which may result in an incorrect issue being reported if firewall settings do not allow TCP connections to ephemeral ports.
- Work Around: None
- Problem trigger: This issue affects clusters with both conditions true: 1) The tscCmdPortRange configuration value is used to specify a port range outside the system's ephemeral port range. 2) Network firewall settings do not allow connections to the system's ephemeral port range.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Operating System environments
- Function Area affected: System Health
- Customer Impact: Suggested IJ11998
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Spectrum Scale mmfsd daemon crashes with "logAssertFailed: sgP->isPanicked()" when running mmlsfileset -i -d command.
- Work Around: Enable the trigger "MTWPlowThroughInode0Holes" based on the guidance from IBM support.
- Problem trigger: While the process of mmlsfileset -i -d command is traversing the indirect block tree of inode 0 files, some inode blocks expansion and copy operations discard the whole indirect block tree of inode 0 files, then cause the problem as above described.
- Symptom: Daemon crash
- Platforms affected: All
- Function Area affected: mmlsfileset command with -i -d options.
- Customer Impact: High Importantance IJ12000
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Command "mmlsfirmware --type drive" is failing when used for the first time when deploying DSS-G systems. It completes when further used on the same system. The error messages are as follows: dss23.cluster: mmcomp: Propagating the cluster configuration data to all dss23.cluster: affected nodes. This is an asynchronous process. mmlsfirmware: Command failed. Examine previous error messages to determine cause.
- Work Around: A work-around would be to run "mmlscompspec" immediately after deploying the system.
- Platforms affected: ESS/GSS configurations.
- Function Area affected: ESS
- Customer Impact: Suggested IJ11952
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Async recovery can be stuck in processing deferred inode deletions for a long time on a busy system. This can queue up gpfs administrative commands behind the aysnc recovery.
- Work Around: A node failure during async recovery can queue up another run of deferred deletions. To overcome this, you can move the sgmgr node and then ensure that there is no node failure while the file system is being recovered. However, this will not prevent this situation from happening again due to further node expels.
- Problem trigger: A file system recovery is triggered due to node expel or first mount and the async portion of the recovery dealing with deferred inode deletions can take a long time to complete on a busy system.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High Importantance IJ12001
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: In case of slow or unstable network connection between a node and the clustermanager, or if the cluster manager is overloaded, the transmission of a health events can timeout. Such a timeout will cause the error message "sdrServ: Communication error on socket 481 (10.14.32.5) @handleRpcReq/AuthenticateIncoming, [err 146] Internal server error message." to show up in the mmfslog of the cluster manager. The log message is harmless because the node will retry to send the health event to the cluster manager.
- Work Around: None
- Problem trigger: Slow or unstable network connectivity or overloaded cluster manager.
- Symptom: Error message "sdrServ: Communication error on socket 481 (10.14.32.5) @handleRpcReq/AuthenticateIncoming, [err 146] Internal server error message."
- Platforms affected: Linux and AIX
- Functional Area affected: Health Monitoring
- Customer Impact: Low Importance IJ12002
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The file system remains internally mounted after running command to get the quota grace period (mmrepquota -t).
- Work Around: Shutdown the daemon.
- Problem trigger: "mmrepquota -t" command.
- Symptom: File system use count leak. The file system remains internally mounted after running mmrepquota -t command to get the quota grace period.
- Platforms affected: ALL Operating System environments
- Functional Area affected: Quotas
- Customer Impact: High Importance IJ12211
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If ibmobjectizer service is started on a non-singleton node it should be stopped automatically but this does not happen. This may interfere with the objectizer functionality.
- Work Around: Manually stop the service on the non-singleton node.
- Problem trigger: This issue affects customers running IBM Spectrum Scale V4.2.3.x Manual start of the ibm objectizer service on a non-singleton node. Failover of object singleton node role could trigger this as well. This will affect also other object services which are meant to be started only once.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Object
- Customer Impact: High Importance IJ12214
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon could die after user application do heavy IO to file system with mixed valid and invalid data buffer.
- Work Around: Check application to avoid invalid data buffer
- Problem trigger: When application appending data to a file with an invalid data buffer such as not big enough or totally invalid, in some case the kernel will failed to transfer data from user space buffer into the GPFS page pool, as a result a buggy buffer desc is left, before the flush buffer detected this and discard this buggy buffer, if the page pool is almost usedup, another IO activity to the file system will steal this buffer, the steal will hit this problem.
- Symptom: GPFS daemon crash.
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ12051
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When a node is added to a Spectrum Scale cluster and that node has a different domain name than the existing nodes in the cluster, some Spectrum Scale commands might display incorrect fully qualified domain names. IJ12049
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmlsfileset command with "-i" and "-d" options hang, and then will block or fail the other conflicted commands.
- Work Around: Shrink the holes of low leve inode file by move files in filesets to root fileset or somewhere, then deleting the emptied filesets.
- Problem trigger: The big holes of low level inode file will cause this problem if there is no good enough number of indirect buffer descriptors.
- Symptom: The mmlsfileset command with "-i" and "-d" options hang.
- Platforms affected: All
- Functional Area affected: mmlsfileset command or mmrestripefs command.
- Customer Impact: High importance IJ24942
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmlsfileset command with -J or -F option can interfere with group protocal and delay/prevent a node from leaving/join the cluster. This issue would show up as nodes staying in arbitrating state.
- Work Around: Don't run mmlsfileset with -J or -F option when there is a node failure.
- Problem trigger: Running mmlsfileset with -J or -F option while node failure is occurring.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: Cluster Membership
- Customer Impact: High importance IJ12492
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AIX operating-system crash with the panic message indicating: "Failure at line 5950 in file vnodeops.C (!(0 != (nfsP->vcmRights & ((0x0001 | 0x0002) | (0x0004 | 0x0008)))) || nfsP->vinfo.oiP)"
- Work Around: The case is specific to nfs4 exports on AIX, but there is no known work-around.
- Problem trigger: The condition under which AIX made the invalid call is not known, but the issue is limited to NFS version 4 exports on AIX.
- Symptom: Abend/Crash
- Platforms affected: AIX/Power only
- Functional Area affected: NFS
- Customer Impact: Critical IJ12491
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM revalidation is slower some times in the caching modes.
- Work Around: None
- Problem trigger: AFM caching modes are used and readdir is performed on them after the refresh intervals expiration.
- Symptom: Performance Impact/Degradation
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: High Importance IJ12490
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Hardlinks for symlinks are not getting queued and this Symlink is queuing as write operation and it is throwing Invalid argument error in operation.
- Work Around: None
- Problem trigger: While performing changeSecondary/resync operation.
- Symptom: It will hit Invalid argument error (err 22) while performing operation.
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ12582
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: An inode of parent directory (which got from Ganesha NFS file handler) is deleted and reused as a regular file, it causes kernel panic on Ganesha server while kernel routine reconnect_path() trying to make the dentry fully connected.
- Work Around: None
- Problem trigger: It may happen when inode which used to be a directory reused as a regular file and the gpfs dentry lookup operation for this new file goes fast path. Note: dentry lookup fast path is optimized code path when this node has strong enough inode and byte range token and the buffer of related block in already in gpfs shared hash table.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: NFS
- Customer Impact: High Importance IJ12493
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Daemon crashes when user appending data into a file and synching the file
- Work Around: None
- Problem trigger: The problem was that a race between data appending and file sync, the race led the file sync uses stale file size to reduce the last data block size.
- Symptom: Daemon crashed
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical 1073648
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: BytesToWrite remain non-zero for MDS node in mmpmon stats for AFM in mapping configuration. Even all bytes are written to home correctly and data is consistent.
- Work Around: None
- Problem trigger: Writing chunks from multiple gateways when mappings are in effect.
- Symptom: Deadlock
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ12496
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If file system has 4K Native sector size disks, user application runs on NSD Server, and the user application does DirectIO with a none 4K aligned file offset, GPFS kernel extension will leak memory (On Linux platform, the leak happens in kmalloc-1024 slab).
- Work Around: None
- Problem trigger: If file system has 4K Native sector size disks, user application runs on NSD Server, and the user application does DirectIO with a none 4K aligned file offset
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environment and AIX/Power
- Functional Area affected: All
- Customer Impact: High Importance IJ12552
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When token manager list is changed (because of token manager node failure or configuration change), client node need report its current token status to new token mangers. For a specific regular file 'f', if node 'A' hold its SX byte range token (because application does DirectIO read/write on this node), and 'f' is also in another node's state cache, gpfs may abend "logAssertFailed: !(brStP->stFlags & 0x20)".
- Work Around: None
- Problem trigger: Token manager list changes because of token manager node failure or configuration change
- Symptom: Abend
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ12779
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Data of big files corrupted after deldisk
- Work Around: None
- Problem trigger: The problem was that some data blocks of big files were skipped in repair in some case of other participants failed and those failed participants were repairing the same big file.
- Symptom: Some data blocks corrupted after deldisk
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Cirtical IJ11626
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: System runs at almost 100% cpu load during IO, all cpu consumed by mmfsd threads which are the threads issuing async calls to LROC- device on nodes running linux kernel level >=3.10.
- Work Around: Disable LROC
- Problem trigger: Using LROC-devices on nodes running linux kernel level >=3.10, without this fix.
- Symptom: High CPU-usage on the nodes having LROC-devices.
- Platforms affected: x86_64-linux only
- Functional Area affected: LROC
- Customer Impact: High Importance 1071297
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: 'mmfsadm dump deferreddeletions' does not show individual counts for toBeDeleted and beingCreated inodes.
- Work Around: None
- Problem trigger: mmfsadm dump deferreddeletions
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ12784
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: NVS v3 i/o with file looks could hang on failover due to deadlock situation in rpc.statd. rpc.statd calls mmstatdcallout for each command which calls sm_notify.ganesha. This triggers new smnotify commands from Ganesha which are blocked due to the single thread still in current sm_notify.ganesha call (deadlock). After some timeout, the deadlock resolves.
- Work Around: None
- Problem trigger: On a system where NFS protocol is configured and NFS v3 i/o is using file locks, a CES ip failover runs into this issue.
- Symptom: Stuck IO
- Platforms affected: ALL Linux OS environments
- Functional Area affected: NFS
- Customer Impact: High Importance IJ12698
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmfsd heap memory leak when run "tsctl showNodes" command. For each "tsctl showNodes" command, GPFS will leak nNodes * 512 Bytes. CES nodes run "tsctl showNodes" command every 15s, so this leak is more likely noticed on CES nodes.
- Work Around: None
- Problem trigger: run "tsctl showNodes" command
- Symptom: Performance Impact/Degradation
- Platforms affected: All
- Functional Area affected: CES
- Customer Impact: High Importance IJ12714
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: False detection of GNR recovery group failure (gnr_rg_failed) because of slow mmlsrecoverygroup.
- Work Around: None
- Problem trigger: High system load, resource contention or slow or unstable network connectivity can lead mmlsrecoverygroup to take longer.
- Symptom: mmhealth shows "gnr_rg_failed" even so the recoverygroup is still ok and IO works fine.
- Platforms affected: Linux
- Functional Area affected: Health Monitoring
- Customer Impact: Low Importance IJ12715
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmcheckquota verbose output is not printed when client nodes failed during mmcheckquota run.
- Work Around: None
- Problem trigger: Node failures while online mmcheckquota command is running.
- Symptom: mmcheckquota command returns with error and doesn't report the calculated quota discrepancy (verbose output).
- Platforms affected: All except Windows
- Functional Area affected: File system core - quotas
- Customer Impact: suggested IJ11999
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: System runs at almost 100% cpu load during IO, all cpu consumed by mmfsd threads which are the threads issuing async calls to LROC- device on nodes running linux kernel level >=3.10.
- Work Around: Disable LROC
- Problem trigger: Using LROC-devices on nodes running linux kernel level >=3.10, without this fix.
- Symptom: High CPU-usage on the nodes having LROC-devices.
- Platforms affected: x86_64-linux only
- Functional Area affected: LROC
- Customer Impact: High Importance IJ12780
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ11626 IJ11952 IJ11996 IJ11998 IJ11999 IJ12000 IJ12001 IJ12002 IJ12049 IJ12050 IJ12051 IJ12094 IJ12197 IJ12199 IJ12204 IJ12206 IJ12211 IJ12214 IJ12490 IJ12491 IJ12492 IJ12493 IJ12494 IJ12496 IJ12552 IJ12582 IJ12698 IJ12714 IJ12715 IJ122780 IJ12784.
Problems fixed in IBM Spectrum Scale 4.2.3.12 [November 15, 2018]
- Problem description: An unexpected inode has been copied into newly created snapshot while the file is being mmaped read/write.
- Work around: Do not create snapshot while there's file being mmaped read/write.
- Problem trigger: Create snapshot while the file is being mmaped read/write.
- Symptom: Daemon assert.
- Platforms affected: All Operating System environments
- Functional Area affected: Snapshots or file mmap
- Customer Impact: Critical IJ10741
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The policy engine aborts the process due to encounter a valid directory as to be processed target but it is not applicable to compress.
- Work around: Refine the compression policy rule to exclude the directories from the to-be-processed targets.
- Problem trigger: Files are being compressed through compression policy rules and some directories are selected as valid targets based on policy rule.
- Symptom: The compression policy rule process is interrupted.
- Platforms affected: ALL Operating System environments
- Functional area affected: compression/policy
- Customer Impact: Suggested IJ10752
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: RPC message is reported lost after reconnect and node is expelled, like below: Message ID 1012408 was lost by node IP_ADDRESS NODE_NAME wasLost 1
- Work around: NO
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional area affected: All
- Customer Impact: High Importance IJ10753
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File system cleanup can't finish as file system panic. This prevent file system remount and could also prevent node from joint cluster after quorum loss.
- Work around: Restart GPFS daemon on the node.
- Problem trigger: File system panic occurs during certain phase of mmrestripefs or mmchpolicy command.
- Symptom: Cluster/File System Outage Node expel/Lost Membership
- Platforms affected: All
- Functional area affected: Admin Commands
- Customer Impact: High Importance IJ10775
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Hit the following assert after reconnect: logAssertFailed: secSendCoalBuf != __null && secSendCoalBufLen > 0
- Work around: No
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Abend/Crash
- Platforms affected: All
- Functional area affected: All
- Customer Impact: High Importance IJ10796
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: RPC message is reported lost after reconnect and node is expelled, like below:
- Work around: No
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional area affected: All
- Customer Impact: High Importance IJ10808
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The --force option of the mmchnode command doesn't work in rare cases. Rare means a very small time window in which a CCR update started but not finished and during the following mmchnode command the --force option won't be passed down to the CCR in a correct way.
- Work around: No
- Problem trigger: CCR must seen an update which was started but not finished e.g. caused by a network partition on the quorum nodes and while this
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional area affected: CCR
- Customer Impact: Critical IJ10810
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Disk lease overdue after reconnect and node is expelled.
- Work around: No
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional area affected: All
- Customer Impact: High Importance IJ10811
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: There would be some threads waiting for the exclusive use of the connection for a long time even though no thread is sending on the connection, for example: Waiting 7192.9293 sec since 07:53:38, monitored, thread 2155 Msg handler ccMsgPing: on ThCond 0x7FE0A80012D0 (InuseCondvar), reason 'waiting for exclusive use of connection for sending msg'
- Work around: No
- Problem trigger: Lots of threads are waiting for sending on one connection, if reconnect happens at that time, it should cause this long waiter
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional area affected: All
- Customer Impact: High Importance IJ10812
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Not all the error messages are getting printed from mmlsfirmware. In particular a warning should be issued if not all the targeted nodes could be reached. Also put out an informational message if the targeted node doesn't have any components that apply to mmlsfirmare. node doesn't have any components that apply to mmlsfirmare.
- Work around: None
- Platforms affected: ESS/GSS configurations.
- Functional Area affected: ESS
- Customer Impact: Suggested IJ10813
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Output of SMB CLI command has extra line breaks when containing user names starting with letter N
- Work around: None
- Problem trigger: Issue a SMB CLI command like mmsmb exportacl add using a user name starting with letter N
- Symptom: Output of SMB CLI command has extra line breaks when containing user names starting with letter N
- Platforms affected: All
- Functional area affected: SMB
- Customer Impact: Suggested - has little or no impact on customer operation IJ10814
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM special operations like EAs/ACLs setting and snapshot creation/deletion might fail due to the incorrect checks for the NFS open file at the home or secondary site. This causes the operations which require special control file might get requeued.
- Work around: None
- Problem trigger: AFM special operations like EA setting, peer snapshot creation etc.
- Symptom: Unexpected Results/Behavior.
- Platforms affected: Linux only
- Functional Area affected: AFM
- Customer Impact: High Importance IJ10971
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Node is expelled from the cluster because message is reported lost after reconnect, like the message below: Message ID 2449 was lost by node IP_ADDR NODE_NAME wasLost 1
- Work around: None
- Problem trigger: Messages are pending for more than 30 seconds waiting for replies and network is not good which leads to reconnect happening
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ10972
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Daemon crashes when user run mmrestripefs with -b option to rebalance the file system.
- Work around: Create a new data pool by adding a new disk into the file system.
- Problem trigger: The problem was that some files have the dataPoolIndex points to a deleted data pool, which led to the daemon crash in mmrestripefs
- Symptom: Daemon crashed
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Critical IJ10556
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: RPC messages may be got twice after reconnect, then hit some sanity check, such as the below assert: logAssertFailed: err == E_OK, at dirop.C 6389
- Work around: No
- Problem trigger: Network is not good which leads to reconnect happening
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ10973
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The disk addresses of blocks for a compressed file are corrupted to be as non-compressed file's disk addresses when extending the small file or setting extended attributes to it, thus lead to the compressed file cannot be read or decompressed.
- Work around: Run offline fsck to fix the corrupted disk address for compressed files.
- Problem trigger: The mmfsd daemon or the system crashes when a small compressed file is being extended to large file or being set big EAs.
- Symptom: The compressed files cannot be read or decompressed
- Platforms affected: All
- Functional Area affected: File compression
- Customer Impact: Critical IJ10974
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The progress indicator of restripe for user files doesn't match the real processed data.
- Work around: Set pitWokerThreadsPerNode to 1, but this will slow down the progress of restripe operation.
- Problem trigger: There are many big files in file system and do restripe operation against it.
- Symptom: Restripe progress could jump to 100% completion from a very small indicator(e.g 5%).
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Medium IJ10975
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This fix Correct the default value of forceLogWriteOnFdatasync for mmlsconfig/mmchconfig to yes. IJ10528
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A few GPFS commands like mmaddnode, mmdelnode or change quorum semantic (mmchnode) may cause the status of systemd mmsdrserv.service to report as failed.
- Work around: Reset or ignore the failed mmsdrserv.service status.
- Problem trigger: mmaddnode, mmdelnode, mmchnode --quorum/--noquorum while GPFS is running.
- Symptom: Error output/message
- Platforms affected: Linux systems with systemd version 219 or later.
- Functional Area affected: Admnin Commands - systemd
- Customer Impact: Suggested IJ10527
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: With NSD protocol, incorrect remote cluster mount is panicked when the fileset is not responding. AFM kills the stuck requests on the remote mount by panicking the remote filesystem. If there are multiple remote filesystems, it is possible that remote filesystem panicked may not be correct one for the fileset.
- Work around: None
- Problem trigger: Usage of multiple remote filesystems and the network issues between cache and home.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux only
- Functional Area affected: AFM
- Customer Impact: High Importance IJ10555
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: There is a deadlock where mutex is not released by ping thread which is in while loop and the same time another thread is waiting to acquired this mutex to set the state.
- Work around: None
- Problem trigger: when homelist is being unregistered and meanwhile other handler trying to register the same homelist.
- Symptom: Deadlock
- Platforms affected: Linux only
- Functional Area affected: AFM
- Customer Impact: Suggested IJ10976
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Lookup and readdir performance issues with AFM ADR after converting the regular independent fileset to the AFM ADR fileset as the asynchronous lookups are sent to gateway node in the application path.
- Work around: None
- Problem trigger: AFM ADR inband conversion
- Symptom: Performance Impact/Degradation
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: High Importance IJ10977
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Primary node does not automatically takeover RG after restart
- Work around: None
- Problem trigger: Primary node goes down and user initiates an explicit takeover to primary node.
- Symptom: The recovery group is relinquished and not served by either primary or secondary even though both nodes are up.
- Platforms affected: Linux Only
- Functional Area affected: ESS/GNR
- Customer Impact: High Importance IJ10529
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GUI activity log files are being partially removed by call home scheduled data collection.
- Work around: mmcallhome schedule delete --task DAILY, mmcallhome schedule delete --task WEEKLY, read the schedules after the issue was fixed.
- Problem trigger: running daily or weekly call home schedules
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: GUI + Callhome
- Customer Impact: Suggested IJ10982
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The reported health state of a component (e.g. FILESYSTEM) which has multiple entities (individual filesystems) is not reported correctly if some of them are HEALTHY, and an other is in TIPS state. The expectation is that the overall state for the component is TIPS in this case.
- Work around: None
- Problem trigger: Have multiple filesystems in HEALTHY state and one or more filesystems in TIPS state. The TIPS state could be reached because the mountpoint of the filesystem is different from its declared mountpoint (check with mmlsfs).
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: System Health
- Customer Impact: has little or no impact on customer operation IJ10984
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: After a "mmchnode --ces-disable" of a CES node using the SMB protocol, there are still SMB/CTDB specific files on the system. This may yield to unexpected side effects if those nodes are moved to a different cluster.
- Work around: Manuel cleanup of "tdb" files in /var/lib/samba
- Problem trigger: Run "mmchnode --ces-disable" on a CES node which has the SMB protocol installed. The expectation is that all protocol specific configuration files are removed, but that is not the case. There are remaining "tdb" files which were not deleted.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: CES
- Customer Impact: High Importance: an issue which might cause a degradation of the system in some manner IJ10985
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: TCT may fail to retrieve data from the cloud, so that the file appears empty or truncated in the local Spectrum Scale file system. This has only been seen during testing when the GPFS daemon process is deliberately interrupted during a TCT recall data from cloud operation.
- Work around: None
- Problem trigger: Mmfsd process is killed, interrupted, or possibly, stressed.
- Symptom: A file that was stored in the TCT Cloud, appears to be empty or truncated.
- Platforms affected: All system with TCT installed and in use.
- Functional Area affected: TCT
- Customer Impact: Critical for customers using TCT. IJ10986
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM with verbs RDMA does not work due to the way AFM changes the thread credentials during the replication.
- Work around: None
- Problem trigger: Always happens when RDMA+AFM is enabled with NSD backend.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux Only
- Functional Area affected: AFM
- Customer Impact: High Importance IJ10530
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File system snapshot create or delete commands don't return over long time, when DMAPI operations are busy, then causes the file system outage because file system was quiesced during the snapshot create or delete operation.
- Work around: Restart Spectrum Scale on DMAPI session node, or wait for the completion of the in-progress DMAPI operations.
- Problem trigger: After DMAPI is enabled and being busy with access operations, do snapshot create or delete operations.
- Symptom: File system outages that no access to it is allowed.
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High IJ10821
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Under very very low maxFilesToCache(100) and low maxStatCache (4k) settings, certain race windows were exposed which resulted in kernel panic/ daemon crashes on nodes with LROC-devices.
- Work around: none
- Problem trigger: On nodes with LROC-devices, under extremely low stat-cache settings, certain race windows are exposed which causes daemon/kernel crashes
- Symptom: Abend/Crash
- Platforms affected: x86_64-linux only, those that support LROC
- Functional Area affected: LROC
- Customer Impact: Critical (could cause data corruption) IJ10962
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Timestamp of file is not of nano second granularity if time stamp is set from nfs client.
- Work around: none
- Problem trigger: Change metadata of the file.
- Symptom: File timestamps are not in nano second granularity.
- Platforms affected: Linux Only
- Functional Area affected: NFS
- Customer Impact: High Importance IJ10531
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: On LROC enabled GPFS node, missing invalidation could lead to stale buffer been used during later read.
- Work around: One can avoid the problem by disable LROC.
- Problem trigger: On GPFS client with LROC enabled, application uses mmap to perform read/write or use a mix buffered and DIO to perform read/write.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux
- Functional Area affected: LROC
- Customer Impact: Critical IJ10573
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmlsquota command fails in AIX for users that belongs to more than 128 groups.
- Work around: None
- Problem trigger: Starting in AIX 7.1, the maximum number of group that a user can be part of is increased to 2048, previously it was 128. The mmlsquota command code needs to be updated to handle users that are members of more than 128 groups.
- Symptom: On AIX, when a user member of more than 128 groups run the mmlsquota command, the command fails with E_INVAL.
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: Suggested IJ10979
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Online replica compare function (mmrestripefs/mmrestripefile with -c option) could report false replica mismatch on last data block of a file. This is more like to happen on files in a snapshot.
- Work around: Use offline fsck with -c option to perform replica compare.
- Problem trigger: Run online replica compare function (mmrestripefs/mmrestripefile -c option) on GPFS 4.1.0.0 - 4.2.3.7
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: High Importance IJ10981
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A single thread self-deadlock problem happened when fine-grained QOS statistics is enabled.
- Work around: Disable the fine-grained QoS statistics and then restart GPFS node on the problem node.
- Problem trigger: Fine-grained QoS statistics is being used and QoSed program is running.
- Symptom: Stuck I/O.
- Platforms affected: ALL Operating System environments
- Functional area affected: QoS
- Customer Impact: Critical IJ11180
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon died after the application using invalid data buffer to append data into a file in GPFS.
- Work around: Check application to avoid invalid data buffer
- Problem trigger: When application appending data to a file with an invalid data buffer such as not big enough or totally invalid, in some case the kernel will failed to transfer data from user space buffer into the GPFS page pool, as a result a buggy buffer desc is left and leading to the assert in later data flush.
- Symptom: GPFS daemon crash.
- Platforms affected: ALL
- Functional area affected: All
- Customer Impact: High Importance, the gpfs will crash, file system will unounted. IJ11181
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmlsfirmware missing drive information.
- Work around: None
- Platforms affected: ESS/GSS configurations.
- Functional area affected: ESS
- Customer Impact: Suggested IJ10985
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ10474 IJ10527 IJ10528 IJ10529 IJ10530 IJ10531 IJ10555 IJ10556 IJ10573 IJ10741 IJ10752 IJ10753 IJ10775 IJ10796 IJ10808 IJ10810 IJ10811 IJ10812 IJ10813 IJ10814 IJ10821 IJ10962 IJ10971 IJ10972 IJ10973 IJ10974 IJ10975 IJ10976 IJ10977 IJ10979 IJ10981 IJ10982 IJ10984 IJ10985 IJ10986 IJ11180 IJ11181.
Problems fixed in IBM Spectrum Scale 4.2.3.11 [September 27, 2018]
- Fix logAssertFailed: !maybeAllBs zip-vfs.C IJ09206
- Access to a file (or directory) may be rejected (permission denied) on AIX even when the associated NFSv4 ACL permits the access type in an explicit entry. Work around: Add the necessary permission to one of the special entries (owner, group, or everyone). Problem trigger: On an AIX node, an NFSv4 ACL grants file access when neither owner, group, or everyone entries grant a similar access. Symptom: Unexpected Results/Behavior Platforms affected: AIX/Power only. Functional Area affected: All Scale Users. Customer Impact: Suggested IJ09209
- GPFS daemon might crash on a quorum node caused by a Signal 6 (abort) when changing the CCR quorum configuration via 'mmchconfig tiebreakerDisks' and/or 'mmchnode --nonquorum'. Work around: None. Problem trigger: GPFS daemon crash might occur (on a quorum node) when changing the CCR quorum configuration via 'mmchconfig tiebreakerDisks' and 'mmchnode --nonquorum -N ...'. Symptom: GPFS daemon crash on a quorum node. Platforms affected: ALL Operating System environments. Functional Area affected: CCR. Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner. IJ09197
- After metadata disk outages, validation errors may be seen, leading to filesystem panic. Messages similar to the following may be an indication of this: Inconsistency in file system metadata. [X] File System gpfs1 unmounted by the system with return code 212, reason code 214. File system: gpfs1 Reason: SGPanic. Work around: Avoid using the mmchdisk command to change metadata disk status while a workload is running (unmounting the filesystem everywhere first may reduce the likelihood of encountering this condition). Problem trigger: Adding ACLs (Access Control Lists) while the mmchdisk command is changing metadata disk state. Symptom: Operation failure due to FS corruption. Platforms affected: ALL Operating System environments. Functional Area affected: All Scale Users. Customer Impact: Critical. IJ09215
- Stripe group was panicked while a node is in the middle of stripe group takeover. The thread that handles stripe group panic is supposed to selecet new manager but blocked by the thread handling the original stripe group takeover. Work around: None. Problem trigger: It is rare condition where takeover rpcs were handled in different order. Symptom: Long waiters and filesystem with no filesystem manager. Platforms affected: ALL Operating System environments. Functional Area affected: All scale users. Customer Impact: This is very rare condition and has little impact on customer operation. IJ09221
- The fix improved error message for debugging purpose only when a deleted node was detected still up. When a node was detected belonging to two clusters, the fix asserts the node from the cluster manager instead of shutting down the local node that detected the moved node. IJ08525
- The command mmshutdown might cause a Kernel assert in gpfsCleanup(). IJ08445
- While reading a clone file a node can crash when getting pool index from an openfile object which is not in cache. Work around: None. Problem trigger: Reading clone files. Symptom: Node crash. Platforms affected: All operating System environments. Functional Area affected: Clones. Customer Impact: Suggested: has little or no impact on customer operation. IJ09394
- File system manager node hit signal 11 when QOS is enabled on a file system which is mounted on hundreds of nodes. Work Around: Do not enable QOS on a file system if it has been mounted on hundreds of nodes, or limit the QOS on the specified number of nodes. Problem trigger: Enable QOS on a file system which is mounted on hundreds of nodes. Symptom: File system outage. Platforms affected: All Operating System environments. Functional Area affected: QOS. Customer Impact: High Importance. IJ08716
- When running more than one 'mmdf -P' instances on same file system concurrently like the below commands, the df output is inaccurate - It may falsely reporting 0% free space: mmdf gpfs1 -P system &, mmdf gpfs1 -P testpool &. Work around: avoid running more than one 'mmdf -P' instances on same file system concurrently. Problem trigger: When running more than one 'mmdf -P' instances on same file system concurrently. Symptom: Error output. Platforms affected: ALL Operating System environments. Functional Area affected: All Scale Users. Customer Impact: Suggested. IJ08645
- Fix a bug that could cause file size to be incorrectly updated to a smaller than expected value. This could happen if node failure occurs when hole is been punched at the end of file. IJ08524
- Fix an issue in AFM environment where leading spaces in file names causes recovery to fail. IJ08684
- Sometimes the NFS ganesha service is stopped after filesystems. Work Around: None. Problem trigger: When a mmshutdown procedure is started the "unmount all filesystems" procedures is started as background process, and triggers the shutdown of a list of CES modules like "shutdown NFS", "shutdown SMB", "shutdown network", etc. If this is not finished within a timeout (60 sec in the reported case) then the mmshutdown procedure continues anyway with unmounting the filesystem. In the reported case the unmount of the sharedroot filesystems was finished before the queued "shutdown NFS" command was executed, and therefore the still running NFS instance terminated immediately. Symptom: Abend/Crash. Platforms affected: Linux Only. Functional Area affected: CES. Customer Impact: has little or no impact on customer operation. IJ09225
- Ping thread is in loop to set STOPPED state and loop count was unexpected huge due to invalid ranges of usecount. Work Around: None. Problem trigger: When Pcache handler is being deleted. Symptom: Deadlock. Platforms affected: Linux Only. Functional Area affected: AFM. Customer Impact: Suggested. IJ09229
- Fix a rare case AIX kernel crash that can happen if gpfs utility (like "tsctl nqStatus") call kx API in a small time window - kernel extension is loaded but has not been initialized yet. IJ08523
- The file decompression failed when using lz4 compression algorithm and hit FSErrBadCompressBlock Structure Error when assertOnStructureError is enabled. Work Around: None. Problem trigger: The file is compressed with lz4 algorithm but only hit this problem when some data of the file is compressed with ratio bigger than 1.0. Symptom: Crash/IO error. Platforms affected: All Operating System environments. Functional Area affected: File compression. Customer Impact: High Importance. IJ09236
- When the network is busy or very slow, there will have a message in the mmfs.log like below: [E] Timed out in 300 seconds waiting for a commMsgCheckMessages reply from node node_ip node_name. Sending expel message. This would result in the node expelled from the cluster. Work around: None. Problem trigger: Heavy network load, such as lots of read and write requests to NSD server. Symptom: Node expel/Lost Membership. Platforms affected: ALL Operating System environments. Functional Area affected: Cluster Membership. Customer Impact: High Importance. IJ08518
- Fix logAssertFailed: nConns == nReplyConns. IJ09203
- Changed code in the CCR to speed-up the cluster manager election. IJ08529
- Fix LOGASSERT(compactP != NULL) fails or signal 11 at CacheObj::releaseLastHold. IJ08519
- mmdf command hung there and never returned. Work Around: None. Problem trigger: Not clear. It might happen if there are only one or a few of nodes mounted the file system when running mmdf command. Symptom: mmdf command hang. Platforms affected: All Operating System environments. Functional Area affected: Admin command. Customer Impact: High Importance. IJ09392
- Daemon assert going off: Assert exp(lockmode == LkObj::nl) in fsop.C resulting in daemon restart. Work Around: None. Problem trigger: Write failure during the replication, ex. network issue. Symptom: Abend/Crash. Platforms affected: Linux Only. Functional Area affected: AFM. Customer Impact: High Importance. IJ09310
- The input sanitization for an internal command was insufficient. Work around: None. Problem trigger: Running mmlsfileset with the -F parameter on a file that does not have the right format. Symptom: Error output/message. Platforms affected: ALL Operating System environments. Functional Area affected: Admin Commands. Customer Impact: Critical. IJ08232
- Fix fastpathDisableWait deadlock. IJ07845
- Fix fsck: Assert in line 539 of file ts/pfsck/cache.C. IJ08445
- Fix ADR: hit Signal 11 FsckDirCache::getInode. IJ08445
- Assert going off: !addrdirty or synchedstale or alldirty. Work Around: None. Problem trigger: Certain customer workload can run into the problem in a specific code path when the part of the allocated disk space beyond the end of the file is not zeroed out. It's rare and timing related. Symptom: Abend/Crash. Platforms affected: all. Functional Area affected: All Scale Users. Customer Impact: High Importance. IJ09204
- The input sanitization for an internal command was insufficient. Benefits of the solution, in customer terms: Improve input sanitization to prevent unexpected errors in case of invalid parameters. Work around: None. Problem trigger: Supplying mmlspool with an excessive number of pool names. Symptom: Error output/message. Platforms affected: ALL Operating System environments. Functional Area affected: Admin Commands. Customer Impact: Suggested. IJ08214
- Long waiters/deadlock with message like "Thread waiting for SG cleanup". Work Around: None. Problem trigger: Remote site or NFS server at remote site became unresponsive. Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters. Platforms affected: Linux Only. Functional Area affected: AFM, AFM ADR. Customer Impact: High Importance. IJ09313
- Fixed an issue in AFM environment where AFM does not set the cached bit on the small sized files even though file is fully cached. IJ08521
- Add a new option --read-only for mmrestripefile -c. Work Around: Users has to use mmrestripefs -c --read-only otherwise. Problem trigger: Users who want to check if a file has any replica mismatch without repairing it but do not want to use mmrestripefs -c --readonly are potentially affected. Symptom: Unexpected Results/Behavior. Platforms affected: ALL Operating System environments. Functional Area affected: All Scale Users. Customer Impact: Suggested. IJ09317
- Hardlink relation is not being consistent even performing prefetch and files are showing different inodes at cache. Work Around: Prefetch needs to be the first operation to be performed on the fileset, before doing any ls/stat kind of operations. Only then hardlink relation can be maintained. Problem trigger: HardLink relation is broken at cache in AFM. Symptom: Fetched hardlink files are having different inode number at cache. Platforms affected: Linux Only. Functional Area affected: AFM. Customer Impact: Suggested. IJ09318
- Fix a crash on protocol node while cleaning up the deferred locks at time Ganesha kill. IJ09545
- Values like "01" or "02", etc. are accepted as arguments for the "mmces log level" command, but yield to a "No such file or directory" error message finally. Work Around: provide the correct one-digit log level numbers. Problem trigger: any number with leading zeros These values were checked for an integer range between 0 and 3, which was passed. 01, 001 etc. is valid as a numeric '1'. However those values were used in some code branches as strings, where it makes a difference if '1' is used or '01'. So the failure was triggered because of that. Symptom: Error output/message. Platforms affected: Linux Only. Functional Area affected: CES. Customer Impact: has little or no impact on customer operation. IJ09333
- Fix a protential signal 11 problem that might occurred when running mmrestripefs -r. Work Around: The problem was caused by invalid DAs so changing the DA manually could fix the problem too. Problem trigger: Users whose files contain invalid DAs and will run mmrestripefs -r are potentially affected. Symptom: Unexpected Results/Behavior. Platforms affected: ALL Operating System environments. Functional Area affected: All Scale Users. Customer Impact: Suggested. IJ08643
- RDMA+AFM does not work for NSD protocol due to the way AFM changes the thread credentials during the replication after opening the remote file. Work Around: None. Problem trigger: Always happens when RDMA+AFM is enabled with NSD backend. Symptom: Unexpected Results/Behavior. Platforms affected: Linux Only. Functional Area affected: AFM. Customer Impact: High Importance. IJ0864
- GPFS daemon crash resulting in loss of file system access when mmdiag --network -Y. Work Around: Don't run mmdiag --network -Y. Problem trigger: Invocation from command line of mmdiag --network -Y. Symptom: Abend/Crash. Platforms affected: ALL Operating System Environments. Functional Area affected: All Scale Users. Customer Impact: High Importance. IJ08720
- File mtime might not be in sync between cache/primary and home/secondary when memory mapped files are modified. Work Around: None. Problem trigger: Always when memory mapped files written. Symptom: Unexpected Results/Behavior. Platforms affected: Linux Only. Functional Area affected: AFM, AFM ADR. Customer Impact: High Importance. IJ09347
- Read on AFM cache fileset fails with error 5 or 22 while using the GPFS backend. Work Around: None. Problem trigger: A control file inode is reused after the AFM was disabled at the home/secondary site, and the control file attribute is not reset. Symptom: Unexpected Results/Behavior. Platforms affected: Linux Only. Functional Area affected: AFM and AFM ADR. Customer Impact: High Importance. IJ09372
- GNR VCD recovery takes much longer time due to limited number of read threads. Work Around: None. Problem trigger: During RG recovery with large amount of GNR VCD. Symptom: Performance Impact/Degradation. Platforms affected: ALL Operating System environments. Functional Area affected: ESS/GNR. Customer Impact: High Importance. IJ09385
- GPFS assert going off: exp((threadFlags & 0x0002) == 0) in file tscomm.C. Work Around: None. Problem trigger: When ESS uses RDMA to write to logtip device with failure. Symptom: Abend/Crash. Platforms affected: ALL Operating System environments. Functional Area affected: ESS/GNR. Customer Impact: High Importance. IJ08714
- Windows ACLs of children folders and files could display incorrect inheritance flags. When a parent directory has both inherited as well as explicit ACLs set on it, a newly created folder/file under this parent will be correctly assigned any inheritable ACEs from the parent. However, the inheritance flags on these inherited ACEs could become inconsistent resulting in the Windows Explorer Security interface displaying wrong inheritance behavior. Also, the root directory of a GPFS drive is incorrectly allowing its ACLs to be changed. Work Around: None. Problem trigger: Usage of complex ACLs in a deeply nested directory structure, wherein an intermediate parent folder has both inherited and explicit ACLs. Attempting ACL modification of GPFS root directory from a Windows node. Symptom: Unexpected Results/Behavior. Platforms affected: Windows only. Functional Area affected: Windows ACLs/Inheritance. Customer Impact: Moderate/Suggested. IJ08573
- Data upload to an existing Salesforce ticket via the command mmcallhome run
SendFile --file
--pmr fails if a proxy is used. Work Around: If it is possible to bypass the proxy and to directly contact the IBM Service server, please disable the proxy usage by running the command: mmcallhome proxy disable. Rerun the mmcallhome run SendFile --file --pmr command. If it is not possible to directly connect the IBM Service server, please store the file to a different PC that has access to the IBM Service web portal and upload it there. Problem trigger: This issue affects customers running IBM Spectrum Scale all releases and having a proxy server configured to be used. Symptom: Error output/message: "Could not create an ECuRep session for the upload". Platforms affected: ALL Operating System environments. Functional Area affected: Callhome. Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability. IJ09457 - "mmhealth node show -a -v -Y" crashes with a message "IndexError: string index out of range" in /var/adm/ras/mmsysmonitor.log. Work Around: do not run "mmhealth node show" without explicitly specifying a component. Problem trigger: happens always when the specified command is executed. Symptom: Abend/Crash. Platforms affected: ALL Linux OS environments. Functional Area affected: System Health. Customer Impact: Suggested. IJ09393
- Under rare circumstances it can happen, that mm-commands talking unexpected long (up to 2 minutes) caused by a slow CCR RPCs between CCR server and client. Work around: None. Problem trigger: CCR server expects a final RPC handshake the client does not provide. Symptom: Performance Impact/Degradation. Platforms affected: Just seen on a Linux OS environment (RHEL). Functional Area affected: Admin Commands and CCR. Customer Impact: High Importance. IJ09231
- The filesystem check module of the python monitor reported a code error. Work Around: Add 'import os' in the header area of the file /usr/lpp/mmfs/lib/mmsysmon/GpfsUtils.py and restart he monitor to solve the issue. Problem trigger: Repeating issue since the monitor cycle runs every few seconds. The issue shows up if /var/mmfs/etc/ignoreAnyMount* or /var/mmfs/etc/ignoreStartupMount* are declared to control the mount behavior of filesystems. Symptom: Error output/message. Platforms affected: ALL Operating System environments. Functional Area affected: System Health. Customer Impact: Suggested: medium impact. The filesystem monitoring has gaps considering the mount control files. IJ09239
- mmchfirmware failing in gpfs4.2.3. mmchcarrier rg_ss2siteio10-ib --replace --pdisk 'e2d2s04'. mmchcarrier : [I] Preparing a new pdisk for use may take many minutes. /usr/lpp/mmfs/bin/tspreparenewpdiskforuse[105]: checkGNRDriveBlocksize: not found [No such file or directory] Resuming pdisk e2d2s04 of RG rg_ss2siteio10-ib. Carrier resumed. This fix applies to GSS/ESS customers. IJ09230
- Erroneous display of the event " ib_rdma_port_width_low". Work Around: On each affected node: edit /var/mmfs/mmsysmon/mmsysmonitor.conf file and "add ib_rdma_monitor_portstate = false" to the "[network]" section. restart monitoring with "mmsysmoncontrol restart". Problem trigger: Running Spectrum Scale > 5.0.1 and a IB driver which cause ibportstate to report a LinkWidth of "undefined (19)". Symptom: Unexpected Results/Behavior. Platforms affected: ALL Linux OS environments. Functional Area affected: System Health. Customer Impact: Suggested: has little or no impact on customer operation. IJ09587
- This update addresses the following APARs: IJ07845 IJ08214 IJ08232 IJ08445 IJ08518 IJ08519 IJ08521 IJ08523 IJ08524 IJ08525 IJ08529 IJ08573 IJ08643 IJ08644 IJ08645 IJ08684 IJ08716 IJ08720 IJ08741 IJ09023 IJ09024 IJ09197 IJ09206 IJ09209 IJ09215 IJ09221 IJ09225 IJ09229 IJ09230 IJ09231 IJ09236 IJ09239 IJ09310 IJ09313 IJ09317 IJ09318 IJ09333 IJ09347 IJ09372 IJ09385 IJ09392 IJ09393 IJ09394 IJ09457 IJ09545 IJ09587.
Problems fixed in IBM Spectrum Scale 4.2.3.10 [July 27, 2018]
- Fix an assert exp(de.getNameP()[0] != 0) in line 654 of file /project/sprelbmd1/build/rbmd11628b/src/avs/fs/mmfs/ts/fs/direct.C which can occur during fsck. IJ07096
- Fix an assert: !"search long and hard in getSnapP" which can occur during fsck. IJ07096
- Fix an assert exp(readRepIndex == -1 || (readRepIndex >= 0 && readRepIndex < 4 && daArr[readRepIndex] != (*DiskAddr::invalidDiskAddrP))) in line 8570 of file /project/sprelttn/build/rttn1632c/src/avs/fs/mmfs/ts/pfsck/cache.C which can occur during fsck. IJ07096
- Fix a sig8 on FsckDirCache::readBlockDA which can occur during fsck. IJ07096
- Fix an Assert exp(!"Assert on Structure Error") in line 362 of file /project/sprelttn424/build/rttn4241730c/src/avs/fs/mmfs/ts/logger/Logger.C which can occur after running fsck. IJ07096
- fix an Assert exp(!"Assert on Structure Error") in line 365 of file /project/sprelttn423/build/rttn423s005a/src/avs/fs/mmfs/ts/logger/Logger.C which can occur during a very stressful file system restore. IJ07096
- Fix corruption that can occur during very stressful create, list, delete snapshots and filesets. IJ07096
- Fix an assert on Structure Error, called from the kernel") in line 693 of file /project/spreltac500/build/rtac5001746a/src/avs/fs/mmfs/ts/logger/Logger. This can happen after a node failure. IJ07096
- Fix an ("mallocSize < SEGSIZE" assert) that can happen on AIX if a very large ACL file exceeds SEGSIZE. IJ07096
- Fix a problem in which "mmfsadm dump nsd" shows incorrect data. This can happen when there is a heavy workload on the file system and it's NSDs have multiple servers and the primary server is failed. IJ07175
- Fix a ESS and GSS deadlock that can occur during an RG failover. IJ07096
- Fix a problem in which a quorum node can not join a cluster. This can occur when there are very many log asserts on the quorum nodes. IJ07096
- Fix an issue in the AFM environment where a gateway node crashes if a remote is not responding. IJ07096
- Fix a mmlsquota endless loop that can occur if mmlsquota command has a syntax error. IJ07176
- Fix a Assert exp(!synchedStale) in line 2770 of file bufdesc. This can occur during an uncompress failure. IJ07096
- Fix the assert exp(secSendCoalBuf != __null && secSendCoalBufLen > 0) that can occur doing secure sending. IJ07096
- Fix deadlock FileBlockWriteFetchHandlerThread: on ThCond. This can occur when there is a remote mount. IJ07096
- Fix a problem in which mmrestoreconfig failed because subcommand mmcheckquota failed with E_NODEV. IJ07096
- Fix logAssertFailed: thisSnap.isSnapOkay() || thisSnap.isSnapEmptying() || thisSnap.getSnapId() == sgP->getEaUpgradeSnapId() which can occur deleting a snapshot.
- Fix a hang, Waiting 10397.9894 sec since 19:32:18, monitored, thread 28099 CommandMsgHandlerThread: on ThCond. This can occur when the file system is being suspended. IJ07096
- Fix a kernel crash Oops: Kernel access of bad area, sig: 11 which can occur when a GPFS filesystem is exported through NFS and there is a heavy locking work load. IJ07096
- Fix "Assert exp(!synchedStale)" that can happen during access of compressed files. IJ07096
- Fix Assert !addrDirty OR synchedStale OR allDirty bufdesc.C 7416. This can occur when compression is involved. IJ07096
- Fix a deadlock SGExceptionAdjustServeTMThread on(MsgRecordCondvar) which can occur during the unmount of the file system because of a Stripe Group panic. IJ07096
- Fix assert exp(Remote ASSERT from node
: SGNotQuiesced snap 9/0 ino 2851912 reason 1 code 0) in line 3447 of file /project/spreltac501/build/rtac5011814e/src/avs/fs/mmfs/ts/cfgmgr/sgmrpc.C. This can happen taking snapshots of AFM file sets. IJ07096 - Fixed hang condition on Linux when mmfsd is executed from a shell. Msg handler sgmMsgTMServe: on ThCond. IJ07096
- Fix an unexpected deadlock/breakup which occurred after long waiters disappeared. IJ07096
- Fix a deadlock with long waiters stuck in either 'makeFreeLogSpace wait for log wrap' or 'Waiting for UpdateLogger data update or read to complete'. This could only occur on HAWC enabled file system. IJ07096
- Fix GPFS assert "logAssertFailed: !isRead" happened when doing data prefetch. IJ07178
- Fix Assert exp(slotsFree + slotsUsed == totalSlots) in line 1112 of file /project/spreltac501/build/rtac5011816d/src/avs/fs/mmfs/ts/pfsck/checkacl.C that can happen during mmfsck. IJ07096
- Fix a rare case logAssert "Assert:(indIndex & 0xFF00000000000000ULL)==0 IndDesc.h" which can happen when write beyond EOF of a file which has lots EA entries. IJ07096
- Fix a problem in which when the NSD type is updated via the mmchconfig updateNsdtype command, the NSD type of the tiebreakerdisks in the CCR cluster were not being updated. IJ07096
- Fix a Sig 11 in tmMonitorStorageLevelThread. IJ07180
- Fix a problem where GPFS can potentially get stuck on dumping kernel thread stack during file system panic. IJ07177
- Fix a problem in which certain large values for mmchattr --compact are incorrectly rejected. IJ07096
- Fix LOGASSERT(getPhase() == snapCmdDone) which can happen if more than one request to delete the same snapshot run concurrently and the fs SGPanic during the delsnapshot process. IJ07096
- This fix makex pCacheStaleCheckTimeout configurable through mmfsadm afm staleCheckTimeout. IJ07096
- Fix a problem in which Prefetch doesn't emit a list of failed files that have been missed to be pulled to cache from home. IJ07181
- Address an issue in Prefetch (migration) where filenames containing '\\' and '\\n' characters need to be handled better. Also address an issue in tsbuhelper to generate list files better at home when filenames contain '\\n' character in them. IJ07096
- Fix a problem in which mmchfirmware printing extraneious output, if a vendor supplied firmware loader is sending output to stdout. This fix applies to Lenovo GSS/DSS customers. IJ07096
- Fix a timing problem where sometimes the mount of file system on local node fails because the node is leaving a remote cluster. IJ07096
- Fix a bug in "mmkeyserv server update" that may cause encryption policy fail to option the key store file. IJ07096
- Fix a crash of the sysmonitor. This can happen if gpfs.callhome is uninstalled. IJ07096
- Fix a problem where mmchdisk incorrectly requires disks in the 'system.log' pool to be 'dataOnly'. IJ07096
- Code fix to solve a problem where adding DA to an existing recovery group results in the DA state to be stuck in in-transition state until the daemon is restarted. IJ07096
- Manpage for mmchconfig' command ('subnets' section) has been updated to describe limitation in the number of subnets a given node may be part of. IJ07096
- Fix the AIX kernel crash problem happening during I/O against inconsistent compressed files. IJ07182
- Fix an issue in AFM environment where if root user have supplementary GID greater than value 42949676, replication might fail and messages are requeued. IJ07183
- Fix a problem in which recovery is being triggered on a fileset that is in s stopped state. IJ07096
- Fix a viInUse assert can occur during an NFS workload. IJ07184
- Fix a problem in which AFM recovery fails with error code 2. This can happen if a directory has special characters (like "?{?J?X?`?W?b?Y"). IJ07417
- Fix a problem in which on RHEL7.5 file operations like functions readdir and iterate fail and cause EBADHANDLE (521) errno for some kernel NFS scenarios. IJ07096
- This fix is for customer, that use a mixed cluster with a minimum release level lower than 4.2.2-0. It will fix the machine-readable output of the mmhealth node show command which also causes false or inconsistent information in the GUI. IJ07096
- Fix logAssertFailed: "useCount >= 0" in file alloc.h. This can occur if you run mmrestripefile -c repeatedly. J07418
- Fix an issue in the AFM environment where control file setup used for transferring EAs/ACLs might hang if remote is not responding. This causes node to run out of RPC handler threads to handle all the incoming messages. IJ07752
- Fix a problem in which mmapplypolicy -L 3 shows garbage characters. IJ07936
- This update addresses the following APARs: IJ07096 IJ07175 IJ07176 IJ07177 IJ07178 IJ07180 IJ07181 IJ07182 IJ07183 IJ07184 IJ07269 IJ07417 J07418 IJ07752 IJ07936.
Problems fixed in IBM Spectrum Scale 4.2.3.9 [June 8, 2018]
- Fix a deadlock that can occur on a rapid repair enabled fs. When there are down disks, and there are data replicas saved on the down disks, and the mnode of this file is undergoing a takeover. IJ06045
- Remove an assert for the code paths of striped recovery logs. IJ06015
- Fix critical command failures that can occur during thread pool exhaustion. IJ06046
- Fix a structure error that can occur when getting block disk addresses for snapshot files. This issue could only happen when GPFS APIs are being used to access the files data. IJ06015
- Fix "Assert exp(0)" that can occur when running mmfsck in non-verbose mode and mmfsck detects corruption but does not run to completion and errors out in between due to an error like the participating node failing. IJ06015
- Fix this error AFM: Cannot find snapshot link directory name for exported file system at home for file system. IJ06015
- Fix assert exp(inodeFlushFlag) openinst-vfs.C 1560. This can occur creating and deleting filesets while taking snapshots. IJ06015
- Fix an issue in the AFM environment where a gateway node crashes due to the race between threads doing lookup and revalidation of files in the same directory. IJ06100
- Address a problem where race between two threads can cause the afmctl file FD to be invalidated and hence causing a daemon crash on the gateway node. This can occur running many setXattr operations on the fileset when the secondary fileset is stale. IJ06015
- Fix an issue in the AFM environment where resync/changeSecondary commands might not copy the dirty data from the cache or primary to home or secondary. IJ06015
- This fix sets "pmmonitor=N" in ZimonCollector.cfg. IJ06015
- Fix the potential data corruption issue when compression and LROC feature are being used. IJ06015
- This fix improves performance on large cluster of gpfs.snap.
- Fix assert exp(!addrdirty or syncedstale or alldirt bufdesc.c 7350. This was seen in a Ganasha/NFS environment. IJ06047
- This fix adds support for the new configuration option "grantOwnerDeletePermission". IJ06049
- Fix a problem in which a hanging NFS process (Ganesha) was not clearly detected. IJ06015
- Fix a kernel crash issue due to the assert "err != E_NOT_METANODE" that can occur when mmap reading a compressed file. IJ06015
- Spectrum Scale CLI was fixed to allow setting 'desired' for smb option "smb encrypt". IJ06015
- Address a problem where STOP command on the fileset (mmafmctl stop) can cause deadlock when there's a parallel Write in the queue taking the SplitWrite path. IJ06015
- This fix will change the mmhealth node show -Y output so that the GUI is able to process specific health events again, that weren't in the right format in a mixed cluster environment. This affects only clusters with a cluster minRelease level lower than 4.2.0 and nodes higher than or equal to 4.2.0. Affected events are: for example, all pool_ and pool- events of the FILESYSTEM component. IJ06218
- Package gpfs.gskit updated to version 8.0.50.86. IJ05666
- As clone is not supported for DR fileset, mmclone snap now will return an error on use. IJ06015
- Fix false positive mmhealth event ads_failed on POWER systems under high load. IJ06015
- Fix an issue in the AFM DR environment where deleted files are not copied to the RPO snapshot at secondary if primary is running recovery out of the recovery+RPO snapshot. IJ06015
- Fix an issue in the AFM environment where failover/resync runs slower for write operations due to connecting the file dentry to the parent. IJ06015
- Fix an issue in the AFM environment where usage of functions dm_read_invis() and dm_write_invis() on AFM filesets results in data corruption. IJ06015
- Fix a problem in which "mmcallhome group auto" only created 1 group and then returned an error. This can occur on SLES nodes. IJ06015
- Fix a problem in which call home stopped sending daily data on Ubuntu after an upgrade. IJ06015
- Fix an issue in the AFM environment where AFM reads the file from home incorrectly if the data replication factor at cache is greater than one. IJ06269
- Fix AFM to be able to migrate EAs and ACLs from a read-only export at home during AFM local-updates and the fileset is read-only. IJ06364
- Fix a problem in which mmcrfs --profile returns an error when both defaultMetadataReplicas and maxMetadataReplicas are specified in the profile. IJ06207
- Fix a mmimgbackup assert with there is a symbolic link and the full pathname length is 1023 bytes. IJ06224
- Fix a FSErrBadCompressBlock structure error. This problem only happens on small compressed files after expanding it and the compressed file has holes. IJ06015
- Fix a problem in which if running with an invalid gpfs.smb package within a CES cluster can cause issues up to data corruption for files accessed via SMB protocol. IJ06015
- This fix introduces a second default encryption configuration string to improve performance. IJ06015
- Fix IO hangs that can occur after rebooting a node. IJ06015
- Fix a signal 11 in PaxosServer::handleCheck. IJ06015
- Fix error in `/usr/lpp/mmfs/bin/mmfsd': double free or corruption (!prev): signal 6 that can occur running mmfsck. IJ06015
- Fix "mmsysmonc: error: no such option: -1" for callhome callbacks. IJ06015
- Fix a problem in which ACL changes at the sever was not reflected to the NFS client. There was no upcall generated for the ACL change by mmputacl/mmeditacl so Ganesha is not able to revalidate the inode and it returns the old cached acls. IJ06015
- Fix a mmfsd shutdown that is caused by running out of memory. This can occur on a very heavy load of fsync and stat calls. IJ06544
- Fix a problem in which gpfs_set_share was being improperly rejected. IJ06513
- Fix translation of POSIX ACLs applied on a GPFS Unix node to access permissions on a GPFS Windows node. IJ06545
- Fix a problem in which "mmhealth cluster show nodes -v" output was wrong. This can happen when the cluster state manager and the cluster manager is the same node and the node does a shutdown. IJ06015
- Fix a problem in which mmlsdisk failed. This can occur when /var/mmfs/etc/ignoreAnyMount exists. IJ06514
- ESS command enhancements: Added an optional gather entry specifier: ess = {all | ess-only | not-on-ess} with default = all. Added ESS-specific commands to daily.conf and weekly.conf. Deleted DefaulsDaily.ess.conf and DefaultWeekly.ess.conf. Deleted CallHome/src/doc/gpfsCallHome_GatherInfo.xls. IJ06015
- Fix a problem of tracing self starting when CCR is disabled, adminMode=allToAll, and mmsdrservPort=0. IJ06753
- Correct the free space reported in 'df' command that was including fragmented disk space. 'df' is supposed to only report free full disk blocks.
- This fix corrects an error in the mmlsenclosure command. This fix applies to GSS/ESS customers that have DCS3700 storage enclosures. mmlsenclosure is not displaying DCS3700 drawer control. IJ06015
- This fix disables writing protocol tracing debug messages to mmfs.log, since they were irrelevant to the user and inconsistently formatted. IJ06015
- Fix a problem in which all NFS activity is stopped. This can occur if node affinity is enabled and ces ips have node affinity tags. IJ06015
- Fix a logAssertFailed: !"Assert on Structure Error" and an unexpected file system unmount that can occur during a log wrapping while there is directory expansion. IJ06864
- This fix increases the maximum supported number of extra IP addresses to 64. IJ06770
- Fix the issue that the extra IP addresses cannot be propagated to other nodes. IJ06015
- Fix "mmcallhome info change" overwriting callhome settings, introduced by another mmcallhome commands, if several callhome commands were executed simultaneously. Fix "mmcallhome group add" creating dummy local group settings, which made the call home setup very confusing (some settings are global and some local). Fix "mmcallhome group add" allowing to add nodes without call home installed to call home groups. Fix "mmcallhome group add" allowing to set nodes without ECuRep connectivity as call home nodes. IJ06015
- Workaround a GNR VCD (vdisk configuration data) inconsistency issue that two vtrack tracks may map to the same physical location in very rare cases when recovering free ptracks which causes RG recovery to fail with error like "[E] Vdisk xxx recoverFreePTracks failure: Error 214 code 2063". With this fix, the RG can be recovered with minimal data lost vs. losing the whole RG. IJ06857
- Fix the potential data corruption issue when compression and LROC feature are being used. IJ06252
- gpfs.snap: improve performance on large cluster. IJ06362
- This update addresses the following APARs: IJ05666 IJ06015 IJ06045 IJ06046 IJ06047 IJ06049 IJ06100 IJ06207 IJ06218 IJ06224 IJ06252 IJ06269 IJ06362 IJ06364 IJ06513 IJ06514 IJ06544 IJ06545 IJ06753 IJ06770 IJ06857 IJ06864.
Problems fixed in IBM Spectrum Scale 4.2.3.8 [April 12, 2018]
- Fix an assert with "logAssertFailed: (SGFilesetId)recordNum <= ((SGFilesetId)999999999)" that can occur when NFS clients access the same files in a snapshot of an independent fileset IJ04666.
- Fix a problem in which offline fsck deadlocks when orphaning inodes IJ04520.
- Fix an assert in BufferDesc::flushBuffer Assert exp(!addrDirty || synchedStale || allDirty inode 554192 block 10 addrDirty 1 synchedStale 0 allDirty 0 that can happen during shutdown IJ04520.
- Fix an assert 'reaperThreadStared == 1' that can occur during fsck IJ04520.
- Fix a problem that when recovery fails it is not returning the correct error code. It returns error 2 no matter what This can occur during fileset recovery and a remote mount is stale IJ04520.
- Fix a possible memory corruption that can occur when group quota information is retrieved by multiple clients concurrently IJ04520.
- Fix a failbacktoprimary --start from old primary failure. This can occur when there is a RPO snapshot mismatches between acting and old primary IJ04520.
- Fix a problem in which mmbackup would produce an empty shadow database file that only contained the file header. This can occur when the shadow database file is a binary file IJ04661.
- Fix a rare case long waiters 'waiting for new SG mgr' which may happen if a
file system has no external mounts and 'tsstatus -m
' command runs on a fs manager node in a specific time window IJ04520. - Fix a problem in which mmlsmount fs will always show that the fs is in the internal mount state on the SG mgr node. Also mmfsadm dump strip show the incorrect state for the fs. This can occur when the SG manager node is switched from manager node to client node and back IJ04520.
- Fix a "struct error: Invalid XAttr overflow" IJ04520.
- Fix a problem in which the Ibmobjectizer fails to objective files when there are a significant number of Openstack projects/accounts (~5000 or more) IJ04660.
- Fix a problem which the failback command fails but return error 0 (No error). This can occur when the failback command is run on a non IW fileset IJ04520.
- Fix an Assert exp(inodeFlushFlag) openinst-vfs.C 1560 which can occur creating and deleting filesets and snapshots during heavy IO IJ04520.
- Fix a problem in which a node is being expelled when there are multiple network reconnect occurring IJ04520.
- Fix a deadlock that can occur when during file system repair IJ04520.
- Fix a problem in which mmfileid was unable to list small files IJ04655.
- Fix a mmapplypolicy/tsapolicy core dump: ThreadThing::check mutexthings.C:170. This can occur during certain failure or recovery scenarios IJ04520.
- Fix an issue in which the fileset failed to be recovered and left in needResync cache state. This can occur during recovery with a heavy workload and the file system is unmounted from the gateway node IJ04520.
- This fix adds improvements to the state monitoring code IJ04665.
- Fix a problem in which a IW fileset directory has been modified at home but the cache fails validate and fetch the new change. This can happen following a recovery or failover that has been run on the IW fileset at the cache site IJ04520.
- Fix a "There is not enough free memory available for use by mmfsck in 192.168.110.35 (c35f2m4n16)." error that can occur running mmfsck in a continuous loop IJ04520.
- Fix a FSSTRUCT error which can occur when there is a race between expanding the first directory block on one node and prefetching of the same block on another node IJ04520.
- Fix a replica mismatch which can occur during restripefs -m or -r and disks are going down and coming up and nodes are going down and coming up IJ04658.
- Fix an assert that can occur initializing certain maintenance commands IJ04520.
- Fix a problem in filesystem recovery where it left the filesystem in a state that might cause other filesystem cmds to hang IJ04520.
- Fix a rare timing assertion when the file system is force unmounted at the same time that quota files are being flushed to disk IJ04520.
- Fix corrupt entries being added to the Ganesha exports configuration file which can occur during the mmfns export add command if all white space is entered for the --client option's argument IJ04520.
- Fix a problem where on AIX the mmcrnsd call clears out the PVID that was assigned by the OS IJ04657.
- Fix code to avoid unnecessary file system panic and unmount on client nodes during mmchdisk start command. The file system panic/unmount could occur when a disk that has been started became unavailable again in the middle of mmchdisk start command IJ04656.
- Fix code so that mmfsd kill does not give IO error to NFS client IJ04520.
- Fix an issue where recovery was stuck on the local cluster due to Gateway node changes in a remote cluster environment IJ04520.
- Fix an exception in AclDataFile::findAcl() that can occur during a node being expelled IJ04520.
- Fix a "Error validating trailer version..." error that can occur when a RG resigns IJ04520.
- Fix an assert like "logAssertFailed: OWNED_BY_CALLER(lockWordCopy, lockWordCopy)" when trying to revive a defective pdisk in RGCK IJ04520.
- Fix a problem where reading symbolic links pointing to nothing at home can cause Assert at the cache site IJ04659.
- Fix assert "exp(isAllocListChanging())" which may occur during a SGPanic IJ04520.
- Fix a crash that can occur on a Gateway node that can occur while updating the policy attribute from home IJ04668.
- Fix an assert that can occur in an AFM environment while running mmunlinkfileset IJ04970.
- Fix the inode indirection level assert that can happen during the failure process of clone file creation IJ04520.
- Fix a "logAssertFailed: *nReservedP == 0, 5932, vbufmgr2.C" assert that can occur while vtrack data corruptions are being detected and fixed IJ04520.
- Fix a problem where a psnap creation on a gateway node, also serving as the FS manager can deadlock when the fileset in question is in need of a recovery IJ04520.
- This fix enables buffer dirty bits debug data to be collected under "debugDataControl heavy" and trace level "fs 4" IJ04662.
- Fix a signal 11 that can happen when there is a race condition between daemon startup, file system mount and snapshot quiesce rpc handling IJ04520.
- Fix a deadlock that can occur if 2 or 3 filesets are in recovery and a gateway node is involved IJ04520.
- Fix a potential assert when a compressed file is updated in the last data block causing a COW to the snapshot that was recently read IJ04520.
- Fix a performance issue in the AFM environment where small size file replications are improved over high latency networks. Feature can be enabled by setting the afmIOFlags=4 IJ04667.
- Fix a problem where mmapplypolicy crashes or loops indefinitely after apparently completing all the work it should have done. Should be unusual, as it only applies when there was a failure of a "helper" during the execution phase IJ04520.
- Fix a problem in which the PaxosChallengeCheck thread reported as a long waiter in the GPFS log and/or the dump file IJ04663.
- Fix a problem in which gpfs_igetattrs with 1M bufferSize fails with ENOMEM IJ04664.
- Fix a problem in which Cron implementation on Ubuntu skips the files in /etc/cron.d with dots in file names IJ04520.
- Fix an assert that can occur when the data copy offset plus the copy length exceeds the file size IJ04520.
- Fix a potential assert when a compressed file is extended due to a truncation operation beyond it's original file size IJ04520.
- Fix a problem in which mmshutdown couldn't cleanup bind mounts and mmmount can't umount bind mounts on older kernels IJ04520.
- Fix mmbackup and mmimgbackup failures that can occur when used with IBM Spectrum Protect because they both use incompatible gskit libraries IJ04669.
- Fix a problem in which the mmcrvdisk command fails if the recoverygroup name contains periods. This fix applies to GSS/ESS customers IJ04520.
- Fix a node crash that can occur when the daemon is shutdown during a mmap write IJ04520.
- Fix a "bgP == __null" assert when doing a truncation operation on a compressed file IJ04520.
- Fix a double memory free issue, which may cause assert like "Assert exp(vHoldCount > 0) in vbufmgr.C:280". This can occur when there is heavy IO and pdisk errors IJ04520.
- Fix CCR client code to avoid segmentation fault during backup command IJ04520.
- Fix code to avoid segmentation fault during PaxosSharedDisk::readDblocks() in GPFS mmfsdi IJ04520.
- Fix a Signal 6 BufferDesc::traceDirtyBits at SmartPtrs.h:1329 during dumping buffer dirty bits IJ04520.
- Fix temporary file system busy state when mounting the file system right after the file system name was changed IJ04520.
- Fix a logAssert "exp(errP != NULL)" which may happen while accessing gpfs snapshots on a nfs client. The log assert is caused by a race of a file access and a snapshot deletion IJ04520.
- Fix a rare deadlock which can happen between command which changes the cluster manager (like mmchmgr, mmexpelnode, mmchnode --nonquorum) and a quorum lost event IJ04520.
- Fix a kernel panic: RIP: cxiPanic, mmfsd at: SendFlock which can occur during a heavy fcntl load and a loss of a node to cause a quorum loss IJ04520.
- Fix a "logAssertFailed: !isCfgMgr()" error which may happen after a node failure event IJ04671.
- Fix recently instroduced slow command performance. It affects server base clusters that disable mmsdrservPort IJ04672.
- Fix code to avoid unexpected GPFS cluster manager changes to other quorum nodes. This can occur on large clusters IE greater then 500 nodes during heavy IO and heavy CCR IE vputs/vgets/fputs/fgets IJ04673.
- Fix a problem in which the mmhealth monitoring daemon was running some commands twice IJ04520.
- Correct inconsistent behavior of mmnfs export list command when -Y is used IJ04520.
- The fix will prevent the daemon from crashing if user uses a large value for number of subblocks IJ04520.
- Fix Python exceptions in the mmnetverify command when the flood operation is used with a target node whose GPFS node number is greater than 255 IJ04862.
- Fix a problem in the mmadquery command not being able to list AD users if the "AD domain name" does not match the "AD domain shortname" IJ04784.
- Address a problem where cleanup on handlers can happen twice (1 called from unmount of the file system and other from a panic on the FS at the same time), and this could result in a bad memory access causing a Signal 11 IJ04520.
- Fix an issue in the AFM environment where a daemon asserts at the gateway node when a file is being removed. This happens when a file is deleted immediately after the creation and the file system is already quiesced IJ04520.
- Fix a problem in which AFM orphan entries cannot be cleaned on line on an AFM disabled fileset IJ04520.
- Fix a problem in which the "mmcesnode suspend -N
- " command
failed to suspend all nodes in the list IJ04087.
- Fix a mmfsd core dump which can occur when mmpmonSocket is receiving events IJ05223.
- This patch must be applied for all systems using Spectrum Scale RAID with write caching drives. If slow disk detection has been disabled by setting the nsdRAIDDiskPerformanceMinLimitPct config parameter to zero, it can be re-enabled by restoring this parameter to its default value IJ04864.
- Fix the false ENOENT error when operating on files in an AFM fileset IJ04520.
- Fix a problem that when pool usage exceeds the warning threshold configured by mmhealth, the message in /var/log/messages talks about "metadata" but should be "data" IJ04863.
- Fix waiting 1786.523031929 seconds, ProbeClusterThread: on ThCond 0x116D2F80 (0x116D2F80) (StripeGroupTableCondvar), reason 'waiting for SG cleanup'. This can occur during a heavy workload while nodes are being expelled IJ04520.
- Fix a mmfsd assert at: Assert exp(mdiBlockP != __null) ts/vdisk/mdIndex.C 2299. This can occur creating a very large vdisk while a repair is in progress IJ04520.
- Fix a log recovery error and file system umounts on all nodes that can occur during heavy directory create, delete, rename work load IJ05483.
- Fix a problem in which mmnfs export list -Y might list more exports than expected and is inconsistent with output when omitting option -Y IJ04520.
- Change to not mark disk down and unmount file system when adding new disk paths for disks being used by the file system IJ05258.
- Fix a problem in which the node can not be started and you see this error: /usr/lpp/mmfs/bin/runmmfs[336]: .[213]: loadKernelExt[674]: InsModWrapper[95]: eval: line 1: 18672: Memory fault. This can occur on SLES12 SP1 after upgrading the kernel to kernel 3.12.74-60.64.82-default IJ05073.
- This update addresses the following APARs: IJ04087 IJ04520 IJ04655 IJ04656 IJ04657 IJ04658 IJ04659 IJ04660 IJ04661 IJ04662 IJ04663 IJ04664 IJ04665 IJ04666 IJ04667 IJ04668 IJ04669 IJ04671 IJ04672 IJ04673 IJ04784 IJ04862 IJ04863 IJ04864 IJ04865 IJ04970 IJ05073 IJ05223 IJ05258 IJ05483.
Problems fixed in IBM Spectrum Scale 4.2.3.7 [February 16, 2018]
- Fix a problem in which if inode expansion is interrupted, it may leave nAllocatedInodes inconsistent between sg descriptor and fileset metadata file IJ03086.
- Fix a problem to avoid filling up /var/adm/ras/ with ever growing mmsysmon.log on AIX IJ02566.
- Fix a problem reading files from a snapshot with mmap on AIX IJ02566.
- Fix an issue in the AFM environment where daemon asserts during the AIO writes on AFM filesets IJ02566.
- Fix code to recover corrupted files in CCRs committed directory during GPFS startup which was causing other components to fail IJ03085.
- Fix E_VALIDATE errors on ACL blocks after disk outage IJ02566.
- Fix a problem where a user cannot make changes to the afmTarget once he has created the fileset with a wrong mapping name (or) host name in the afmTarget field IJ02566.
- Fix a race condition where we skip generating a dmapi close event IJ02628.
- Fix a problem in which one way reconnect misses sending the cleanup RPC which results in performance degradations IJ02627.
- Fix node hangs due to the consumption of DMAPI event mailboxes IJ03083.
- Fix a log assert which may happen during mmdelsnapshot if the file in snapshot has DITTO xattr overflow block address IJ02566.
- If HAWC is enabled for a file system for which log recovery has failed, the recovery log is no longer dumped, because the recovery log may contain user data. Also, dump files are now created with more restricted permissions IJ02566.
- Fix an issue in the AFM environment where gateway node hangs under the stress due to mailboxes unavailability IJ02566.
- Fix an issue in the AFM environment where daemon asserts while mounting the fileset target path. This happens when AFM is not enabled at home IJ02566.
- This fix adds a SG Panic check for message MBHashFetch and MBHashFetchAsync IJ02566.
- This fix Stops the monitor immediately when a node leaves a cluster to avoid unexpected behavior when the node is re-added to the cluster IJ02566.
- Fix the potential duplicated RPC issue while doing network reconnect IJ02566.
- This fix optimizes closing a file system to enable mmfsck process to start ASAP IJ02566.
- This fix stops recovery of a fileset, if the recovery fails with error 78(TIMEDOUT) IJ02566.
- This fix avoids gpfsReserveDelegation exceptions for kworker return of nfs4 leases IJ02566.
- This fix adds a SG Panic check in data block flush IJ02566.
- Fix an issue in the AFM environment where a gateway node runs into a soft lockup issue with UID remapping enabled. This happens when cache and home are running on different architectures IJ02566.
- Fix a problem where we need to treat NFS bad filehandle errors as STALE during AFM failover/DR changesecondary and thereby continuing replication without having to drop the queue IJ02566.
- This fix Increases the default value of the socketMaxListenConnections configuration variable to 8192 on Linux IJ02566.
- Fix a problem in which the mmlsquota and the mmsetquota manpages were missing a reference to mmrepquota -t IJ02630.
- Fix sample script filehist to avoid a fail with divide by zero error. This can occur if the file system has different block sizes. It can also occur if there is no /dev IJ03084.
- The command mmkeyserv tenant show now also shows the RKM Id information IJ02566.
- Fix a pdisk state transition issue when the disk drive power on procedure takes longer time than expected. When bringing up this kind of pdisk from power off state, sometimes mmchpdisk --revive may report error and pdisk goes missing instead of becoming ok IJ02566.
- Fix a problem where a deadlock can happen between application IO to the AFM fileset when the home/secondary site fileset has gone stale IJ02566.
- Fix assert exp(!"oldDiskAddrFound.compAddr(*oldDiskAddrP)") which may happens when preallocating data in an inode file. Note, fallocate() on GPFS file system or write()/fallocate() on a FPO file system can trigger preallocation IJ03162.
- Fix an issue in the AFM environment where some files are moved to .ptrash directory intermittently over GPFS backend IJ03095.
- Fix a problem in which the RGCM keeps causing RG to be resigned and recovery failure when primary node is down IJ03248.
- Fix a problem with Receive Worker threads going CPU bound IJ03087.
- Fix a deadlock involving failed "mmfsctl resume" command, SG panic and disk issues IJ02566.
- Fix the pending log file migration assert that can happen when doing file system restripe operation or adding/deleting/changing the file system disks IJ02566.
- Fix erratic inode expansion behavior and spurious 'Expanded inode space' log messages under multi-node create workloads IJ02566.
- Fix a rare case that truncate() does not set file size correctly. The file size is set to full block boundary incorrectly and the fragment is lost IJ03091.
- Fix a problem in which GNR deadlocks with mmdiag --waiters showing many threads in state "wait for log buffers". This occurs some time after errors were encountered on log tip devices IJ02566.
- Fix an issue in the AFM environment where already existing uncached files are not prefetched correctly IJ02631.
- Fix log assert "Assert exp(totalSteps >= 0) in file workthread.C". It happens when running mmlsfileset -r or mmlsfileset -i command against a file system which has huge inode number or lots independent filesets IJ03234.
- Fix a problem where pmsensor service crashes because there are NULL entries returned from mmpmon for AFM filesets IJ03101.
- Fix a problem in which we see mismatched replicas in the aclFile after down disks IJ02566.
- Fix a problem that occurs when the command "mmhealth cluster show" is run with an invalid component name IJ02566.
- Fix a problem in which we did not show filesystem information in mmhealth output for AIX IJ02566.
- This fix improves the monitor for handling of failed disks IJ02566.
- Fix a problem in which the installer fails with error "Not all services available on specified nodes" IJ02566.
- This fix improves the output of mmhealth to be more helpful for problem determination IJ02566.
- Fix unnecessary failovers caused by starting multiple instances of the same monitor IJ02566.
- Fix a problem in which the mmsysmon.log was not being rotated daily on RHEL 6/SLES 11 IJ02566.
- Fix a problem in which the 'mmnfs export list' command cannot be piped to 'head' IJ02566.
- Fix an issue in the AFM environment where files are moved to .ptrash during the rename on independent-writer mode filesets IJ02566.
- Fix a mount fail that occurs when someone has an environment lock IJ02566.
- Fix fcntl performance issue IJ03096.
- Fix a problem with mmsmb exportacl list producing wrong output IJ03166.
- Fix an issue in the AFM environment where gateway node crashes intermittently. Also fix an issue where lookup returns incorrect results IJ02566.
- Fix display issues on the GUI for health data, this can occur if entities are called similar to health's main component (e.g. "ces" or "gpfs") IJ02664.
- Fix a problem in which network monitoring can leave components in checking state and exceptions in the trace log. This can occur when the name is like loop@loop IJ02566.
- This fix Ensures that server reachability is accurately reported for multiple servers with CES stack configured for LDAP authentication IJ03097.
- Fix a problem where, when recovery policy fails with an error 2, we need to rerun the policy with higher debug level for policy IJ02566.
- This fix adds a protection to prevent a compressed fragment from being expanded without being uncompressed first in some unexpected conditions of having inconsistent compression flags. This fix also replace an assert with an IO error to minimize the user impact IJ03102.
- Fix an issue that remote NSD clients drop into a long time retry loop during an ESS outage. This can occur if there are multiple ESS building blocks and GPFS replication is enabled in the cluster. When shutting down both servers of a ESS building block simultaneously, remote NSD clients can experience a long retry loop like 'waiting for stateful NSD server error takeover (1)' IJ03098.
- Fix a problem where recovery keeps failing with an error 2 because the AFM recovery script wasn't able to handle directory names in the fileset that had trailing spaces in them IJ03103.
- This fix changes the output of the command "mmcallhome" to be identical to the 5.0.0.0 output IJ02566.
- Fix Ubuntu 14.4.5:mmsysmonitor failed to start IJ02566.
- This fix prevents time outs that can occur when using mmapplypolicy with TCT (mmcloudgateway) when large instance counts are used for migrations on a given cluster node IJ03194.
- Fix a segmentation fault that can occur when you have very long file path names being read into policy generated files IJ02566.
- Fix a mount failure and mmchdisk/mmrestripefs performance issues IJ03636.
- Fix code to call the tiebreakerCheck user exit script even if the CCR is enabled IJ02566.
- Fix an issue in the AFM environment where file listing during a readdir fails for dirty files in local-updates mode. This problem happens with the ganesha NFS server having AFM local-updates mode fileset exports IJ03424.
- This fix improves the trace logging of node-to-node health event propagation IJ03689.
- Fix a problem that when open(O_TRUNC) fails (due to a share conflict), the data gets lost IJ03608.
- Fix a problem in which gpfs.snap stops with an error message when it stores (TARs) log files IJ03898.
- Fix a problem in which gpfs.snap: dmesg -T option doesn't work on SLES11 IJ02566.
- This update addresses the following APARs: IJ02566 IJ02627 IJ02628 IJ02630 IJ02631 IJ02664 IJ03083 IJ03084 IJ03085 IJ03086 IJ03087 IJ03091 IJ03095 IJ03096 IJ03097 IJ03098 IJ03101 IJ03102 IJ03103 IJ03162 IJ03166 IJ03194 IJ03229 IJ03234 IJ03235 IJ03248 IJ03353 IJ03367 IJ03424 IJ03608 IJ03636 IJ03689 IJ03898.
Problems fixed in IBM Spectrum Scale 4.2.3.6 [November 30, 2017]
- Fix potential issue with cesiplist file updates in ccr that can result in messages like "cesiplistLocalSerial is not numeric: ()" IJ02158.
- This fix excludes hidden CCR files from the scheduled callhome data collection IJ00977.
- Fix code to avoid long running CCR synods on different quorum nodes causing long running GPFS 'mmgetstate -a' command IJ00977.
- Fix an issue in the AFM environment where AFM prefetch causes daemon assert if the directories are deleted after the prefetch queueing IJ00977.
- This fix respects the mmdelsnapshot -N option when resuming DeleteRequired snapshot IJ02220.
- Fix a rare assert that can occur during metanode takeover due to a stalled indirect block left in the cache IJ00977.
- Fix a deadlock caused by the allocation region requests handler. Users would see long waiters on allocation manager cursors when deadlock happens IJ01063.
- Address a problem where failbackToPrimary --start command tries to delete any later snapshots than the latest snapshot present at the old primary end. But when that snapshot is not present at the acting primary - we need to handle the error rather than ignoring and continuing the failback command IJ00977.
- Fix an inode count leak problem which may happen when gpfs_iwrite/gpfs_iwritex API fails with ESTALE, tsrestorefileset utility uses this GPFS API IJ00977.
- Fix hangs and timeouts that occur during snapshot commands in rare failure cases IJ01335.
- Fix a problem where the close issued on a remote NFS mount can get stuck and causes an unlink of an AFM fileset to get stuck IJ00977.
- Fix the assert issue on generation number when flushing or writing indirect blocks. This issue only happens when the clone files were used and deleted IJ00977.
- Fix a disk address assert that can occur when a thread reads a compressed region of a file at the same time when a different node uncompresses the same region IJ00977.
- Fix a vectored DIO (writev/readv) dead lock which may happen if the filesystem is being quiesced IJ00977.
- Fix a potential infinite loop when reading a compressed file with alternating compressed and uncompressed regions IJ00977.
- Add specific handling of SKLM error messages in case a required configuration parameter in the SKLMConfig.properties file is missing. Add a more detailed error message from mmkeyserv command in case a configuration parameter is wrong or missing IJ00977.
- Fix a potential snapshot file data corruption that can be caused by a crash occurring when a compressed file is being deleted from the active file system or a snapshot IJ00977.
- Fix a NULL buffer pointer dereference problem by adding synchronization for accessing the buffer pointer IJ00977.
- Fix a problem in disk verification that wrongly calculated on disk stripe group descriptor checksum IJ01325.
- Fix a potential E_HOSTDOWN (80) error when a compressed file is being appended while the node is the process of becoming a metanode at the same time IJ00977.
- Fix an issue in the AFM environment where recovery fails with error 112 while checking for the deleted directories IJ00977.
- Fix the no space issue when running mmchdisk start command. A similar issue can happen on normal writes IJ01065.
- Add more provision to catch a case where a Queue item is becoming NULL when IO is happening to the fileset and the queue is being flushed IJ00977.
- Fix an assert that can occur when xattrs is heavily used and there is an unusual block size setting IJ00977.
- An inaccurate and unnecessary assert in buffer bitmap processing is removed IJ00977.
- Fix wrong fs struct error format IJ00977.
- Fix a deadlock situation involving lock conflict while stealing a buffer for file system metadata repair IJ00977.
- Fix a logic bug which may cause a log recovery to fail with E_RECOV_INCOMPLETE (code 234), this problem can happen on PARALLEL_LOGRECOVERY enabled builds (since GPFS 4.2.1) if log file size is bigger than 16MB IJ00977.
- Add a debugging utility to calculate checksum values of disk data IJ00977.
- Fix a problem in which mmgetstate -s may not display the correct number of quorum nodes defined in the cluster IJ01064.
- Fix a signal 7 that can occur when a compressed file is expanded in hyper allocation mode IJ00977.
- Address a problem where re-applyupdates should not invoke failbacktoprimary --start when failbacktoprimary --stop is failed due to changes that are detected at acting primary IJ00977.
- Fix assert "'false' failed" in paxosserver.C:3129 in the GPFS daemon (CCR) that happened during GPFS startup IJ00977.
- Fix an issue in the AFM environment where deleting directories from .ptrash directory fails with directory not empty error. This issue happens when the directory is deleted from home before readdir is performed at cache IJ01066.
- In a dump file, the dump directory size was incorrectly reported as PB when the unit is TB. The problem is now fixed IJ00977.
- Fix a problem in which AIX/NFS servers deadlock trying to recover a client's fcntl lock following the loss of another node in the same cluster IJ01068.
- Fix a problem in which the prefetch command did not work on a special character (--) named file IJ01067.
- When the read or write of a log vdisk fails during rebuild operation, use the IO error code to trigger resignation, as opposed to using E_OK IJ00977.
- Fix a potential assert that can occur when a compressed file is being closed after having been deleted and any compressed compression group within the file was partially copied (COW) before the file deletion IJ00977.
- This fix supports callbacks with long list of parameters IJ01069.
- With this fix the output of mmsmb export list -Y and mmsmb config list -Y is changed. It now has an additional colon at the end of the output lines IJ00977.
- Fix an issue in the AFM environment where the cached bit is not set after reading the entire file. This causes the eviction failures and also performance degradation during the write operations IJ00977.
- Address a problem where changing the backend from NFS to GPFS (or viceversa) - can cause bad filehandle errors IJ00977.
- Update directory code to avoid excessive recursion that could lead to stack overflow. Stack overflow could cause GPFS daemon to either crash with Signal 11 or get stuck in a signal handler IJ01070.
- Fix a DBGASSERT exp(bytesLeftInStride > 0) which may happen if multiple threads access the same file and (at least) one of them access the file with stride access pattern IJ01114.
- Fix service_running appearing in the mmhealth eventlog without a reason IJ00977.
- This fix is recommended if you see file system hangs requiring reboots to recover IJ00977.
- Fix a GPFS daemon assert that can occur when the inode0 file grows to more than 4B blocks IJ01087.
- Fix a deadlock scenario involving starting a disk at the time of recovery IJ01328.
- Fix an issue in the AFM environment where a prefetch can cause a filesystem quiesce not to happen when home is not responding. This will cause a deadlock at cache cluster until home starts responding IJ00977.
- This fix prevents a segmentation fault in tslspdisk. This fix applies to GSS/ESS customers IJ01327.
- Fix an issue in the AFM environment where ACLs are not updated properly in the cache with directory inheritance. This happens when users do not have permission to update the ACLs IJ00977.
- Fix an assert - exp(ioDataUpdateInProgress == 0 OR DaemonShuttingDown) which may happen if the application does IO with fuzzySequential access pattern IJ00977.
- Fix for gpfs.snap to collect CES address marker files (node-affinity information) IJ01072.
- Fix code to speed-up GPFS CCR read requests and mm-commands when reading from the CCR IJ01086.
- This fix allows customers to run the mmchenclosure command to confirm that a storage enclosure fan is reporting a failure and can be replaced. This fix applies to GSS/ESS customers IJ01330.
- In a mixed cluster where the HSM session manager runs on a 4.2.x node, the access to HSM migrated files from a 4.1.x node now works fine IJ01115.
- Fix a problem in the gpfs mmap code that can result in negative mmap counters. When a file gets memory mapped by a child process GPFS skipped incramenting mmap counters when it failed to verifying its credentials because of the number of groupids exceeded the limit. But decramented mmap counters during close time. This caused a node to crash because of the negative mmap counter IJ01913.
- Fix a performance problem in mmsmb and a minor problem with its machine readable output IJ01073.
- Fix a problem that can result in a flood of handle_network_problem_info mmhealth events. This can cause the GUI to crash IJ02010.
- Fix the high CPU usage issue on Windows due to a busy loop in a receiver thread when there are some network errors IJ01863.
- This update addresses the following APARs: IJ00977 IJ01063 IJ01064 IJ01065 IJ01066 IJ01067 IJ01068 IJ01069 IJ01070 IJ01072 IJ01073 IJ01086 IJ01087 IJ01114 IJ01115 IJ01325 IJ01327 IJ01328 IJ01330 IJ01335 IJ01863 IJ01913 IJ02010 IJ02158 IJ02220.
Problems fixed in IBM Spectrum Scale 4.2.3.5 [October 12, 2017]
- Fix a log assert "Unable to find cached PG map entry for pg X in vIndex Y". This fix will produce a GNR event log when unable to fix a media error IV99611.
- Fix a problem where the nsd was deleted and created again, then the node tried to reread disk configuration so it can update the nsd information, but network issue caused that to fail, then the node got stale nsd info that led to mount failure IV99675.
- Fix a mutex locking order problem, which can lead to a deadlock when the file system is being closed IV99611.
- Fix the use count leak on a stripe group to resolve a stripe group cleanup pending issue IV99611.
- Fix an assert 'iter->second' in the GPFS daemon (CCR) that can occur during mmshutdown IV99611.
- Fix a problem in which the CTIME is not updated correctly on files, Ganesha, IV99677.
- Fix the 93 seconds delay always seen during GPFS daemon startup on the current cluster manager node IV99611.
- Fix GPFS (CCR) logic to close used socket file descriptors just one time avoiding failed GPFS remote procedure calls IV99611.
- Fix generating unnecessary recalls when truncating migrated files IV99676.
- Fix a problem in which a file system unmount will fail if FileHeat is enabled and snapshots are present IV99611.
- Fix a problem in which the mmnfs export list command fails in an unpredictable manner IV99611.
- Fix log assert when a Windows node is added into a cluster that has an encrypted fs IV99611.
- Fix a ofP->inodeLk.get_lock_state() & (0x2000000000000000LL | 0x4000000000000000LL) assert that can occur when FileHeat is enabled and snapshots are present IV99611.
- Fix a problem in which offline fsck does not repair all ind block replicas in reserved files which can lead to more corruptions during fs use IJ00397.
- The fix affects customers using mmhealth THRESHOLD SERVICE. Fix a problem in which the mmhealth THRESHOLD state for some nodes never changes from CHECKING. This is for all platforms IV99611.
- Fix a problem in which the default grace period on a Ganesha system is not displayed correctly IV99611.
- Fix for blocked cesFailoverLock (cesFailoverLock: failed with rc 99) IV99611.
- Fix a problem in which accessing a TCT migrated file can result in a hang when thumbnail support is used IV99611.
- Fix a (delay forever" == "completed") daemon assert IV99678.
- Fix an issue in the AFM environment where incorrect filtering under certain workloads causes the writes to be dropped. This causes the replication not to happen fully and causes the data mismatch between cache/primary and home/secondary IV99796.
- The fix affects customers that have renamed the cluster and is using mmhealth THRESHOLD SERVICE. Fix a problem in which the SYSTEM HEALTH eventlog contains unhealthy alerts for pool_data, pool_metadata, and inode components even though they don't have capacity utilization problems. This can occur on any platform IV99611.
- Fix a segmentation fault that happens when the file system rebalancing fails to open the file system IV99611.
- Fix an issue in the AFM environment where incorrect entries in the prefetch list file (ex. . and ..) causes directory block corruption because AFM permits the filename as '.' to be created without validation of the input IV99679.
- Fix a performance degradation problem when running tar from an NFS client IV99709.
- Correct the %filesetName that is passed to the callback command for the usageUnderSoftQuota callback event. This can only occur on FILESET quota types IV99680.
- Fix quota usage accounting, in a file system with strict allocation "whenpossible", when not all data replicas can be allocated due to lack of space or failure groups IJ00031.
- Fix a deadlock seen while using NFS with TCT IJ00094.
- Fix possible file system corruption caused by a network reconnect IJ00398.
- This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IJ00398 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796.
Problems fixed in IBM Spectrum Scale 4.2.3.4 [August 24, 2017]
- Avoid erroneous FSSTRUCT error in rare cases after a SG panic.
- Fix a problem in which ESS server node deadlocks with many threads showing 'wait for log buffers'.
- Fix a memory corruption issue that can occurs during/after a reconnect.
- Fix a logAssert "!IsMemoryMappingFree" which is caused by a race between mmshutdown and 'tsctl nqStatus'.
- Fix a a possible GPFS daemon assert that can occur while running the mmdelsnapshot command. The assert can happen when prefetch is reading from a snapshot that is being deleted.
- Fix an issue in AFM ADR environment where secondary mount failure causes kernel crash.
- Ensure a copy of the keystone and auth config is created when the Object protocol is disabled and uninstalled from the protocol environment.
- Fix an issue in AFM environment where unresponsive remote mount causes synchronous operations like Reads to fail intermittently.
- Address a problem in resync/failover/changeSecondary where while recreating a deleted file at home/secondary it might cause an invalid memory access and cause the daemon to crash.
- Fix a condition where cNFS on SLES12 or later fail to restart statd.
- Fix deadlock that can occur during inode cleanup in Linux kernel 3.13 and later.
- Address a problem where applyupdates should not be started until failback --start is completed successfully.
- Fix an incorrect assertion which can go off when the file system manager is brought down while running one of the following commands mmrestripefs/mmdeldisk/mmrpldisk.
- Fix an issue in the AFM environment where a fileset unlink or a unresponsive remote mount can cause a deadlock.
- Improve snapshot command error reporting when batching is used.
- Fix a problem with the GPFS file system metadata scanning function in IBM Spectrum Scale 4.2.3.0 - 4.2.3.3 which may result in file system data or metadata corruption on certain failures, like run out of GPFS pagepool memory while running mmrestripefs, mmdeldisk, mmrpldisk, or mmadddisk -r.
- Return ESTALE when file referenced by FileHandle is deleted instead of ENOENT.
- Fix a kernel assert caused by missing buffer lock checking.
- Address a problem where applyUpdates generates operations on files/dirs that were removed from primary - but never played to secondary and later applyUpdates fail to pull such files/dirs back.
- Fix a problem in which GPFS was returning EBADF when Ganesha provided an fd which is not a GPFS fd.
- Fix a mmfsd daemon crashes after upgrading GNR code level and issuing mmchconfig release=LATEST.
- Fix an issue in the AFM environment where mmgetstate or mmdiag commands causes daemon crash if handlers are being disabled.
- Fix mmsetquota to work with non-standard username.
- Fix a deadlock issue in AFM environment when new gateway node joins the cluster and it takes over the fileset from existing gateway node where the workload is running.
- Change in lockd behavior on SLES12 SP2 may cause it to reboot while recovering another cNFS node.
- Fix mmlsquota -u: to work with a non-standard username.
- Fix for missing arping command path declaration for CentOS.
- Fix mmlscluster: to correctly output the "Remote shell command" line.
- This fix reduces the snap processing time for clusters which have the Object protocol deployed with the Unified File and Object feature enabled.
- Fix a problem in which mmsmb exportacl list shows SID instead of user name for all users but the first.
- Fix a memory leak in GNR that may cause mmfsd heap memory usage to increase over time, particularly when the workload does many small writes. The problem occurs in an ESS or GSS environment.
- Fix a daemon crash in the AFM environment where a replication error or a fileset unlink causes a memory handler to be accessed incorrectly.
- Fix a bug where failure to execute the mmdevdiscover script resulted in all pdisks to temporarily lose their I/O paths. This caused the workload to pause while paths were recovered. Sometimes it caused the recovery group to fail over to the backup node. In a few instances, it resulted in the unmount of the file system, requiring manual intervention to restore service.
- This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063.
Problems fixed in IBM Spectrum Scale 4.2.3.3 [July 27, 2017]
- Fix an issue in AFM environment where lookup and metatdata operations on same file from different nodes causes the daemon assert.
- Allow a user to specify the afmrpo interval in weeks[W],days[D] or hours[H].
- Fix a fcntl revoke handler exception that can occur after an EIO error.
- Fix a pdisk firmware version attribute issue. After the pdisk firmware level is changed, mmlspdisk still shows the old version which may mislead the administrator.
- Change EIO to ESTALET for open operation of a file that was deleted.
- Fix a race condition that causes command to fail with "invalid version on put".
- Don't allow kernel modules to cleanup when removing gpfs.gpl if gpfs.gplbin is currently installed.
- Fix an Assert: exp(dm != inv) L-813 in ../fs/fsop.C which can occur when trying to resend a read event.
- Fix a mmfsd crashes, due to a Signal 6 (abort). This can happen removing socket connections in a CCR environment.
- Fix a very rare fault that can occur during heavy directory update workloads.
- Fix a problem in which mmchfirmware --type storage-enclosure fails if running in adminMode=central where only one node has ssh privileges. This fix applies to GSS/ESS customers.
- The mmlsmount command has been changed on all platforms. The change only affects the output format of the -Y argument only when IPv6 address is used.
- This change does not allow mmchcluster -p LATEST on a CCR enabled cluster.
- This fix a problem in which unnecessary allocation manager cursors are being consumed.
- Fix an assert "ioStatsP == __null" that can occur when creating a file system with "-v yes" option, after the "fastest" readReplicaPolicy is enabled.
- Fix a err 112 rename failure that can occur during recovery for IW fileset.
- Fix an AIX kernel crash due to assert "freeing vnode not on gnode list".
- Fix an issue in the AFM environment where an unresponsive target causes a queue to be dropped during the attribute setting.
- Fix a problem where a verbsRdmaSend enable node sent excessive nsdMsgRdmaPrepare to an AIX node.
- Fix a replicas mismatch problem caused by mmrestripefs -b wrongly resetting the missupdate flag.
- Fix CCR and/or mmsdrcli-RPC request errors that can occur during authentication of the incoming socket connections.
- Address a problem where renames across directories do not reset the dirty bit which in future leads to a big list of dirty directories and hence recovery on AFM filesets might take longer to scan.
- Fix an issue in the AFM environment where UIDs in ACLS are not remapped during replication over NSD protocol when UID remapping is enabled.
- When an AFM fileset is to be converted to a regular independent fileset, first check for incomplete dirs, uncached files and orphans. If found inform the user to run prefetch, prior to the conversion.
- Fix a condition where mmautoload may hung.
- Fix a mmlogsort failure that can occur when mmlogsort attempts to query the time zone information on a node that is down.
- Fix a gpfs.snap command failure in a sudo wrapper environment if the legacy log timestamp format is used.
- Fix a divide by zero problem, when running mmrestripefs, which is specific to a file system using directly attached disks only with no NSD servers defined.
- Fix a dynassert 'mmapFlushSXLock.isLockedShared' which may fail as a secondary failure while daemon is shutting down.
- Fix a problem in which a CES ip could not be removed from a node. This can occur when problems are occurring during a CES ip move or failover (e.g. network issues, CCR issues, quorum loss). Subsequent runs of mmcesnetworkmonitor did not fix this and the ip remain active on a node where it should not.
- This fix adds more group memberships (up to 2048) on AIX.
- Fix a "freeSpace != __null" assert. This issue could only happen when doing file system rebalance after suspending some disks.
- Fix a problem in which you get a Unable to create file in fileset error even if the inode limit is not reached which is most likely to occur if the user fills up the fileset from a single node.
- Fix mmcommon test scpwrap.
- This update addresses the following APARs: IV97601 IV97676 IV97677 IV97678 IV97680 IV97681 IV97682 IV97683 IV97685 IV97693 IV97808 IV97836 IV98052 IV98053 IV98054 IV98058.
Problems fixed in IBM Spectrum Scale 4.2.3.2 [June 21, 2017]
- Address a problem where AFM recovery stalls on a read of an IW fileset when it waits to fetch the file from home after the recovery completes.
- Improve a conditional ccr update for CES IPs list file.
- Fix a problem that causes RenameHandler long waiter. This can occur if PIT is in progress.
- Fix a Ganesha crash that can occur when the user enters a string which contains a colon in any mmnfs command that requires a client option or client list string.
- Fix an assert that can occur when changing gateway nodes to non gateway nodes while operations are being performed on an iw fileset and then the non gateway nodes are turned back into gateway nodes.
- Fix a rare deadlock that can occur between a thread handling mmap and a thread handling a memory map pagefault.
- Fix an E_STALE failure that can occur when during a DMAPI dm_read_invis.
- Fix long waiters that can occur on a very busy system doing background snapshot deletion.
- Fix a case where GPFS skipped shrinking lastdata block which causes excess space to be consumed.
- Improve the mmsetquota error message that occurs when a block limit is specified in 'T' unit and larger than 909T is specified.
- Fix an assert that can occur when mmcheckquota and mmrepquota are passed fileset ids from deleted filesets.
- Fix an issue in the AFM environment where afmHardMemThreshold configuration value is not honored and more memory is used than specified.
- Increase the wait time for commands to execute, before failing.
- Correct formatting of large call counts reported by "mmfsadm vfsstats".
- Fix long waiters that can occur after a file system panic and a very busy system.
- Fix a problem in which inodes become Busy after unmount with NFS and immutable files.
- mmkeyserv: Make it possible to set certain attributes to the default with the use of "delete" or "default" keyword.
- Fix a problem in which mkdir, creates, and resync can fail during revalidation from cache/primary to home/secondary in newer kernels 3.18 or later.
- Fix a problem in which GPFS can not handle errors that occur when a DM application was unable to retrieve data due to offline tape.
- Suppress repeated message "Expanded ... inode space N from X to Y inodes" in mmfs.log.
- Fix a rare quota management deadlock caused by error conditions such as out of disk space.
- Fix an issue in AFM+HSM environment where resync/failover/changeSecondary commands fails to replicate migrated files.
- Fix an issue in the AFM environment where a fileset force unlink could cause the daemon to crash.
- Address a problem where a gateway node can assert/crash when having more than 1024 active fileset operations occurring across different filesets on a single gateway node.
- Fix a problem in which gpfs.snap may not gather mmfs logs on AIX nodes correctly.
- Fix a clone parent file deletion performance issue.
- Fix a problem in which fsetxattr failed with ENOENT using a fd of an unlinked file.
- Fix the Assert exp(fileLockHeld != LkObj::nl) in fetchBufferM() that can occur when compression is being used.
- Fix a problem where DMAPI invis read/write fails with an err 22 when calling from non session node.
- Fix a problem in which the mmnetverify command does not correctly verify remote addresses when running many tests in parallel.
- Fix a deadlock that is very rare and can occur after running snapshot commands.
- Fix a policy problem which causes the LOWDISKSPACE callback to not trigger after a fs manager takeover when the old fs manager fails because of an abort or a lost connection.
- This update addresses the following APARs: IV96355 IV96416 IV96417 IV96418 IV96419 IV96420 IV96425 IV96426 IV96429 IV96472 IV96473 IV96474 IV96476 IV96482 IV96483 IV96487 IV96488 IV96585 IV96761 IV96762 IV96763 IV96764 IV96783 IV96786 IV96791.
Problems fixed in GPFS 4.2.3.1 [May 16, 2017]
- Fix a Ganesha crash caused by an applyUpdate.
- Fix a ccrio initialization failure (err 811) when changing the daemon-interface.
- Fix a rare segmentation fault in the mmgetstatus command.
- Fix a SIGBUS error that can occur during a mmap read on a snapshot file.
- Fix a problem in which we see a flood of "failed to scrub vdisk" log message when GNR node experiences quorum loss. This is for ESS/GSS.
- Fix a rare race between unlink, lookup and token revoke which causes kernel crash in d_revalidate.
- This fix will make sure Ganesha request reference a valid GPFS filesystem.
- Fix a system hang that can occur when a file system is suspended while doing a mmap.
- This fix rejects unreasonable large requests to preallocate inodes immediately with ENOSPC.
- Fix a directory rename issue with IW filesets that can occur if the rename target is an existing directory.
- Fix a fault that can occur when restripe runs while the SG is not mounted on all NSD nodes.
- This fix restricts the afmMaxParallelRecoveries config value from 0 to 128.
- This fix removes the unnecessary error message "cannot open /proc/net/tcp6" when shutting down GPFS.
- Fix a problem with not properly handling quotas in an AFM environment. This can occur when you have very large hard and soft limit values.
- Fix a "exp(!sgP->isSGMgr())" assert that can occur when you delete a file system and then create a new file system with the same name at the same time.
- Fix an err 112 that can show up in the mmfs logs when mmchnode --gateway is executed.
- Fix a kernel crash that can occur while attempting to mount a loop device to a correspond file in a GPFS file system or while using a GPFS file system file as a LIO backend.
- Address a problem where applyUpdates continues to run even if the fileset at the old primary is unlinked or the mmfs daemon has been shutdown.
- Fix an outband resync failure that can occur if a recovery is triggered by deleting some files in a directory and the directory itself. This is an AFM/DR environment.
- Fix rename conflicts that can occur in SW/DR filesets.
- Update log code to prevent log recovery error when log file became illReplicated. This could happen on file system with -K set to NO and there is not enough disk space for full replication.
- This fix will use new interface that will reduce multiple retries every time a lock is freed and there are multiple waiters for the lock.
- Fix an assert that can occur with a DR fileset and the file system is suspended.
- Fix bug that requires a large free space in /var/mmfs to run change commands.
- Fix recovery failure err 17 when psnap0 deletion fails.
- Fix a daemon assert that can occur in an AFM environment where the mmfsd daemon fails to start repeatedly with a DMAPI enabled filesystem at a gateway node.
- Address a problem where trying to queue a writeSplit message to the helper gateway's queue can fail with an error 28 (E_NOSPAC).
- Fix an issue which returns EACCESS(errno = 13) while running mmapplypolicy when there is a mounted NFS file system which has the same name with a GPFS file system.
- A fastpath optimization defect can result in an internal error to be returned to the user when it is safe to continue without entering the fast path.
- Install if you suffer from mmapplypolicy/tspolicy hanging after otherwise finishing all work.
- cNFS: fix a problem with /usr/sbin/rpcinfo not found in SLES12 or later.
- Fix a failure in Object Authentication configuration with Active Directory or LDAP. This fix is only required if Object is being configured with Active Directory or LDAP and DN of Swift service user(specified in --ks-swift-user) is more than 79 characters.
- Fix a problem with ESS disk replacement in which the mmchcarrier command may wipe out the pdisk location code. The problem will prevent the subsequent mmchcarrier command to proceed without a valid location code.
- Fix a problem in which a GPFS command may wrongly terminate another process.
- Fix a rare deadlock problem caused by stream write(enableRepWriteStream=yes).
- Update log recovery code to avoid GPFS daemon assert after detecting invalid directory block during log recovery. Code has been changed to log a FSSTRUCT error and fail the log recovery so offline mmfsck can be run on the file system.
- Fix a mmfsd crashes (incompleteOk assertion), when the number of files in the committed directory doesn't match the number of files in CCR's file list in case of a new CCR file update request.
- This update addresses the following APARs: IV94991 IV94992 IV94994 IV94995 IV94996 IV94997 IV94998 IV95015 IV95021 IV95230 IV95557 IV95643 IV95925 IV96037 IV96163.