Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 4.2.x applies for all supported platforms.
Problems fixed in IBM Spectrum Scale 4.2.3.10 [July 27, 2018]
- Fix an assert exp(de.getNameP()[0] != 0) in line 654 of file /project/sprelbmd1/build/rbmd11628b/src/avs/fs/mmfs/ts/fs/direct.C which can occur during fsck. IJ07096
- Fix an assert: !"search long and hard in getSnapP" which can occur during fsck. IJ07096
- Fix an assert exp(readRepIndex == -1 || (readRepIndex >= 0 && readRepIndex < 4 && daArr[readRepIndex] != (*DiskAddr::invalidDiskAddrP))) in line 8570 of file /project/sprelttn/build/rttn1632c/src/avs/fs/mmfs/ts/pfsck/cache.C which can occur during fsck. IJ07096
- Fix a sig8 on FsckDirCache::readBlockDA which can occur during fsck. IJ07096
- Fix an Assert exp(!"Assert on Structure Error") in line 362 of file /project/sprelttn424/build/rttn4241730c/src/avs/fs/mmfs/ts/logger/Logger.C which can occur after running fsck. IJ07096
- fix an Assert exp(!"Assert on Structure Error") in line 365 of file /project/sprelttn423/build/rttn423s005a/src/avs/fs/mmfs/ts/logger/Logger.C which can occur during a very stressful file system restore. IJ07096
- Fix corruption that can occur during very stressful create, list, delete snapshots and filesets. IJ07096
- Fix an assert on Structure Error, called from the kernel") in line 693 of file /project/spreltac500/build/rtac5001746a/src/avs/fs/mmfs/ts/logger/Logger. This can happen after a node failure. IJ07096
- Fix an ("mallocSize < SEGSIZE" assert) that can happen on AIX if a very large ACL file exceeds SEGSIZE. IJ07096
- Fix a problem in which "mmfsadm dump nsd" shows incorrect data. This can happen when there is a heavy workload on the file system and it's NSDs have multiple servers and the primary server is failed. IJ07175
- Fix a ESS and GSS deadlock that can occur during an RG failover. IJ07096
- Fix a problem in which a quorum node can not join a cluster. This can occur when there are very many log asserts on the quorum nodes. IJ07096
- Fix an issue in the AFM environment where a gateway node crashes if a remote is not responding. IJ07096
- Fix a mmlsquota endless loop that can occur if mmlsquota command has a syntax error. IJ07176
- Fix a Assert exp(!synchedStale) in line 2770 of file bufdesc. This can occur during an uncompress failure. IJ07096
- Fix the assert exp(secSendCoalBuf != __null && secSendCoalBufLen > 0) that can occur doing secure sending. IJ07096
- Fix deadlock FileBlockWriteFetchHandlerThread: on ThCond. This can occur when there is a remote mount. IJ07096
- Fix a problem in which mmrestoreconfig failed because subcommand mmcheckquota failed with E_NODEV. IJ07096
- Fix logAssertFailed: thisSnap.isSnapOkay() || thisSnap.isSnapEmptying() || thisSnap.getSnapId() == sgP->getEaUpgradeSnapId() which can occur deleting a snapshot.
- Fix a hang, Waiting 10397.9894 sec since 19:32:18, monitored, thread 28099 CommandMsgHandlerThread: on ThCond. This can occur when the file system is being suspended. IJ07096
- Fix a kernel crash Oops: Kernel access of bad area, sig: 11 which can occur when a GPFS filesystem is exported through NFS and there is a heavy locking work load. IJ07096
- Fix "Assert exp(!synchedStale)" that can happen during access of compressed files. IJ07096
- Fix Assert !addrDirty OR synchedStale OR allDirty bufdesc.C 7416. This can occur when compression is involved. IJ07096
- Fix a deadlock SGExceptionAdjustServeTMThread on(MsgRecordCondvar) which can occur during the unmount of the file system because of a Stripe Group panic. IJ07096
- Fix assert exp(Remote ASSERT from node
: SGNotQuiesced snap 9/0 ino 2851912 reason 1 code 0) in line 3447 of file /project/spreltac501/build/rtac5011814e/src/avs/fs/mmfs/ts/cfgmgr/sgmrpc.C. This can happen taking snapshots of AFM file sets. IJ07096 - Fixed hang condition on Linux when mmfsd is executed from a shell. Msg handler sgmMsgTMServe: on ThCond. IJ07096
- Fix an unexpected deadlock/breakup which occurred after long waiters disappeared. IJ07096
- Fix a deadlock with long waiters stuck in either 'makeFreeLogSpace wait for log wrap' or 'Waiting for UpdateLogger data update or read to complete'. This could only occur on HAWC enabled file system. IJ07096
- Fix GPFS assert "logAssertFailed: !isRead" happened when doing data prefetch. IJ07178
- Fix Assert exp(slotsFree + slotsUsed == totalSlots) in line 1112 of file /project/spreltac501/build/rtac5011816d/src/avs/fs/mmfs/ts/pfsck/checkacl.C that can happen during mmfsck. IJ07096
- Fix a rare case logAssert "Assert:(indIndex & 0xFF00000000000000ULL)==0 IndDesc.h" which can happen when write beyond EOF of a file which has lots EA entries. IJ07096
- Fix a problem in which when the NSD type is updated via the mmchconfig updateNsdtype command, the NSD type of the tiebreakerdisks in the CCR cluster were not being updated. IJ07096
- Fix a Sig 11 in tmMonitorStorageLevelThread. IJ07180
- Fix a problem where GPFS can potentially get stuck on dumping kernel thread stack during file system panic. IJ07177
- Fix a problem in which certain large values for mmchattr --compact are incorrectly rejected. IJ07096
- Fix LOGASSERT(getPhase() == snapCmdDone) which can happen if more than one request to delete the same snapshot run concurrently and the fs SGPanic during the delsnapshot process. IJ07096
- This fix makex pCacheStaleCheckTimeout configurable through mmfsadm afm staleCheckTimeout. IJ07096
- Fix a problem in which Prefetch doesn't emit a list of failed files that have been missed to be pulled to cache from home. IJ07181
- Address an issue in Prefetch (migration) where filenames containing '\\' and '\\n' characters need to be handled better. Also address an issue in tsbuhelper to generate list files better at home when filenames contain '\\n' character in them. IJ07096
- Fix a problem in which mmchfirmware printing extraneious output, if a vendor supplied firmware loader is sending output to stdout. This fix applies to Lenovo GSS/DSS customers. IJ07096
- Fix a timing problem where sometimes the mount of file system on local node fails because the node is leaving a remote cluster. IJ07096
- Fix a bug in "mmkeyserv server update" that may cause encryption policy fail to option the key store file. IJ07096
- Fix a crash of the sysmonitor. This can happen if gpfs.callhome is uninstalled. IJ07096
- Fix a problem where mmchdisk incorrectly requires disks in the 'system.log' pool to be 'dataOnly'. IJ07096
- Code fix to solve a problem where adding DA to an existing recovery group results in the DA state to be stuck in in-transition state until the daemon is restarted. IJ07096
- Manpage for mmchconfig' command ('subnets' section) has been updated to describe limitation in the number of subnets a given node may be part of. IJ07096
- Fix the AIX kernel crash problem happening during I/O against inconsistent compressed files. IJ07182
- Fix an issue in AFM environment where if root user have supplementary GID greater than value 42949676, replication might fail and messages are requeued. IJ07183
- Fix a problem in which recovery is being triggered on a fileset that is in s stopped state. IJ07096
- Fix a viInUse assert can occur during an NFS workload. IJ07184
- Fix a problem in which AFM recovery fails with error code 2. This can happen if a directory has special characters (like "?{?J?X?`?W?b?Y"). IJ07417
- Fix a problem in which on RHEL7.5 file operations like functions readdir and iterate fail and cause EBADHANDLE (521) errno for some kernel NFS scenarios. IJ07096
- This fix is for customer, that use a mixed cluster with a minimum release level lower than 4.2.2-0. It will fix the machine-readable output of the mmhealth node show command which also causes false or inconsistent information in the GUI. IJ07096
- Fix logAssertFailed: "useCount >= 0" in file alloc.h. This can occur if you run mmrestripefile -c repeatedly. J07418
- Fix an issue in the AFM environment where control file setup used for transferring EAs/ACLs might hang if remote is not responding. This causes node to run out of RPC handler threads to handle all the incoming messages. IJ07752
- Fix a problem in which mmapplypolicy -L 3 shows garbage characters. IJ07936
- This update addresses the following APARs: IJ07096 IJ07175 IJ07176 IJ07177 IJ07178 IJ07180 IJ07181 IJ07182 IJ07183 IJ07184 IJ07269 IJ07417 J07418 IJ07752 IJ07936.
Problems fixed in IBM Spectrum Scale 4.2.3.9 [June 8, 2018]
- Fix a deadlock that can occur on a rapid repair enabled fs. When there are down disks, and there are data replicas saved on the down disks, and the mnode of this file is undergoing a takeover. IJ06045
- Remove an assert for the code paths of striped recovery logs. IJ06015
- Fix critical command failures that can occur during thread pool exhaustion. IJ06046
- Fix a structure error that can occur when getting block disk addresses for snapshot files. This issue could only happen when GPFS APIs are being used to access the files data. IJ06015
- Fix "Assert exp(0)" that can occur when running mmfsck in non-verbose mode and mmfsck detects corruption but does not run to completion and errors out in between due to an error like the participating node failing. IJ06015
- Fix this error AFM: Cannot find snapshot link directory name for exported file system at home for file system. IJ06015
- Fix assert exp(inodeFlushFlag) openinst-vfs.C 1560. This can occur creating and deleting filesets while taking snapshots. IJ06015
- Fix an issue in the AFM environment where a gateway node crashes due to the race between threads doing lookup and revalidation of files in the same directory. IJ06100
- Address a problem where race between two threads can cause the afmctl file FD to be invalidated and hence causing a daemon crash on the gateway node. This can occur running many setXattr operations on the fileset when the secondary fileset is stale. IJ06015
- Fix an issue in the AFM environment where resync/changeSecondary commands might not copy the dirty data from the cache or primary to home or secondary. IJ06015
- This fix sets "pmmonitor=N" in ZimonCollector.cfg. IJ06015
- Fix the potential data corruption issue when compression and LROC feature are being used. IJ06015
- This fix improves performance on large cluster of gpfs.snap.
- Fix assert exp(!addrdirty or syncedstale or alldirt bufdesc.c 7350. This was seen in a Ganasha/NFS environment. IJ06047
- This fix adds support for the new configuration option "grantOwnerDeletePermission". IJ06049
- Fix a problem in which a hanging NFS process (Ganesha) was not clearly detected. IJ06015
- Fix a kernel crash issue due to the assert "err != E_NOT_METANODE" that can occur when mmap reading a compressed file. IJ06015
- Spectrum Scale CLI was fixed to allow setting 'desired' for smb option "smb encrypt". IJ06015
- Address a problem where STOP command on the fileset (mmafmctl stop) can cause deadlock when there's a parallel Write in the queue taking the SplitWrite path. IJ06015
- This fix will change the mmhealth node show -Y output so that the GUI is able to process specific health events again, that weren't in the right format in a mixed cluster environment. This affects only clusters with a cluster minRelease level lower than 4.2.0 and nodes higher than or equal to 4.2.0. Affected events are: for example, all pool_ and pool- events of the FILESYSTEM component. IJ06218
- Package gpfs.gskit updated to version 8.0.50.86. IJ05666
- As clone is not supported for DR fileset, mmclone snap now will return an error on use. IJ06015
- Fix false positive mmhealth event ads_failed on POWER systems under high load. IJ06015
- Fix an issue in the AFM DR environment where deleted files are not copied to the RPO snapshot at secondary if primary is running recovery out of the recovery+RPO snapshot. IJ06015
- Fix an issue in the AFM environment where failover/resync runs slower for write operations due to connecting the file dentry to the parent. IJ06015
- Fix an issue in the AFM environment where usage of functions dm_read_invis() and dm_write_invis() on AFM filesets results in data corruption. IJ06015
- Fix a problem in which "mmcallhome group auto" only created 1 group and then returned an error. This can occur on SLES nodes. IJ06015
- Fix a problem in which call home stopped sending daily data on Ubuntu after an upgrade. IJ06015
- Fix an issue in the AFM environment where AFM reads the file from home incorrectly if the data replication factor at cache is greater than one. IJ06269
- Fix AFM to be able to migrate EAs and ACLs from a read-only export at home during AFM local-updates and the fileset is read-only. IJ06364
- Fix a problem in which mmcrfs --profile returns an error when both defaultMetadataReplicas and maxMetadataReplicas are specified in the profile. IJ06207
- Fix a mmimgbackup assert with there is a symbolic link and the full pathname length is 1023 bytes. IJ06224
- Fix a FSErrBadCompressBlock structure error. This problem only happens on small compressed files after expanding it and the compressed file has holes. IJ06015
- Fix a problem in which if running with an invalid gpfs.smb package within a CES cluster can cause issues up to data corruption for files accessed via SMB protocol. IJ06015
- This fix introduces a second default encryption configuration string to improve performance. IJ06015
- Fix IO hangs that can occur after rebooting a node. IJ06015
- Fix a signal 11 in PaxosServer::handleCheck. IJ06015
- Fix error in `/usr/lpp/mmfs/bin/mmfsd': double free or corruption (!prev): signal 6 that can occur running mmfsck. IJ06015
- Fix "mmsysmonc: error: no such option: -1" for callhome callbacks. IJ06015
- Fix a problem in which ACL changes at the sever was not reflected to the NFS client. There was no upcall generated for the ACL change by mmputacl/mmeditacl so Ganesha is not able to revalidate the inode and it returns the old cached acls. IJ06015
- Fix a mmfsd shutdown that is caused by running out of memory. This can occur on a very heavy load of fsync and stat calls. IJ06544
- Fix a problem in which gpfs_set_share was being improperly rejected. IJ06513
- Fix translation of POSIX ACLs applied on a GPFS Unix node to access permissions on a GPFS Windows node. IJ06545
- Fix a problem in which "mmhealth cluster show nodes -v" output was wrong. This can happen when the cluster state manager and the cluster manager is the same node and the node does a shutdown. IJ06015
- Fix a problem in which mmlsdisk failed. This can occur when /var/mmfs/etc/ignoreAnyMount exists. IJ06514
- ESS command enhancements: Added an optional gather entry specifier: ess = {all | ess-only | not-on-ess} with default = all. Added ESS-specific commands to daily.conf and weekly.conf. Deleted DefaulsDaily.ess.conf and DefaultWeekly.ess.conf. Deleted CallHome/src/doc/gpfsCallHome_GatherInfo.xls. IJ06015
- Fix a problem of tracing self starting when CCR is disabled, adminMode=allToAll, and mmsdrservPort=0. IJ06753
- Correct the free space reported in 'df' command that was including fragmented disk space. 'df' is supposed to only report free full disk blocks.
- This fix corrects an error in the mmlsenclosure command. This fix applies to GSS/ESS customers that have DCS3700 storage enclosures. mmlsenclosure is not displaying DCS3700 drawer control. IJ06015
- This fix disables writing protocol tracing debug messages to mmfs.log, since they were irrelevant to the user and inconsistently formatted. IJ06015
- Fix a problem in which all NFS activity is stopped. This can occur if node affinity is enabled and ces ips have node affinity tags. IJ06015
- Fix a logAssertFailed: !"Assert on Structure Error" and an unexpected file system unmount that can occur during a log wrapping while there is directory expansion. IJ06864
- This fix increases the maximum supported number of extra IP addresses to 64. IJ06770
- Fix the issue that the extra IP addresses cannot be propagated to other nodes. IJ06015
- Fix "mmcallhome info change" overwriting callhome settings, introduced by another mmcallhome commands, if several callhome commands were executed simultaneously. Fix "mmcallhome group add" creating dummy local group settings, which made the call home setup very confusing (some settings are global and some local). Fix "mmcallhome group add" allowing to add nodes without call home installed to call home groups. Fix "mmcallhome group add" allowing to set nodes without ECuRep connectivity as call home nodes. IJ06015
- Workaround a GNR VCD (vdisk configuration data) inconsistency issue that two vtrack tracks may map to the same physical location in very rare cases when recovering free ptracks which causes RG recovery to fail with error like "[E] Vdisk xxx recoverFreePTracks failure: Error 214 code 2063". With this fix, the RG can be recovered with minimal data lost vs. losing the whole RG. IJ06857
- Fix the potential data corruption issue when compression and LROC feature are being used. IJ06252
- gpfs.snap: improve performance on large cluster. IJ06362
- This update addresses the following APARs: IJ05666 IJ06015 IJ06045 IJ06046 IJ06047 IJ06049 IJ06100 IJ06207 IJ06218 IJ06224 IJ06252 IJ06269 IJ06362 IJ06364 IJ06513 IJ06514 IJ06544 IJ06545 IJ06753 IJ06770 IJ06857 IJ06864.
Problems fixed in IBM Spectrum Scale 4.2.3.8 [April 12, 2018]
- Fix an assert with "logAssertFailed: (SGFilesetId)recordNum <= ((SGFilesetId)999999999)" that can occur when NFS clients access the same files in a snapshot of an independent fileset IJ04666.
- Fix a problem in which offline fsck deadlocks when orphaning inodes IJ04520.
- Fix an assert in BufferDesc::flushBuffer Assert exp(!addrDirty || synchedStale || allDirty inode 554192 block 10 addrDirty 1 synchedStale 0 allDirty 0 that can happen during shutdown IJ04520.
- Fix an assert 'reaperThreadStared == 1' that can occur during fsck IJ04520.
- Fix a problem that when recovery fails it is not returning the correct error code. It returns error 2 no matter what This can occur during fileset recovery and a remote mount is stale IJ04520.
- Fix a possible memory corruption that can occur when group quota information is retrieved by multiple clients concurrently IJ04520.
- Fix a failbacktoprimary --start from old primary failure. This can occur when there is a RPO snapshot mismatches between acting and old primary IJ04520.
- Fix a problem in which mmbackup would produce an empty shadow database file that only contained the file header. This can occur when the shadow database file is a binary file IJ04661.
- Fix a rare case long waiters 'waiting for new SG mgr' which may happen if a
file system has no external mounts and 'tsstatus -m
' command runs on a fs manager node in a specific time window IJ04520. - Fix a problem in which mmlsmount fs will always show that the fs is in the internal mount state on the SG mgr node. Also mmfsadm dump strip show the incorrect state for the fs. This can occur when the SG manager node is switched from manager node to client node and back IJ04520.
- Fix a "struct error: Invalid XAttr overflow" IJ04520.
- Fix a problem in which the Ibmobjectizer fails to objective files when there are a significant number of Openstack projects/accounts (~5000 or more) IJ04660.
- Fix a problem which the failback command fails but return error 0 (No error). This can occur when the failback command is run on a non IW fileset IJ04520.
- Fix an Assert exp(inodeFlushFlag) openinst-vfs.C 1560 which can occur creating and deleting filesets and snapshots during heavy IO IJ04520.
- Fix a problem in which a node is being expelled when there are multiple network reconnect occurring IJ04520.
- Fix a deadlock that can occur when during file system repair IJ04520.
- Fix a problem in which mmfileid was unable to list small files IJ04655.
- Fix a mmapplypolicy/tsapolicy core dump: ThreadThing::check mutexthings.C:170. This can occur during certain failure or recovery scenarios IJ04520.
- Fix an issue in which the fileset failed to be recovered and left in needResync cache state. This can occur during recovery with a heavy workload and the file system is unmounted from the gateway node IJ04520.
- This fix adds improvements to the state monitoring code IJ04665.
- Fix a problem in which a IW fileset directory has been modified at home but the cache fails validate and fetch the new change. This can happen following a recovery or failover that has been run on the IW fileset at the cache site IJ04520.
- Fix a "There is not enough free memory available for use by mmfsck in 192.168.110.35 (c35f2m4n16)." error that can occur running mmfsck in a continuous loop IJ04520.
- Fix a FSSTRUCT error which can occur when there is a race between expanding the first directory block on one node and prefetching of the same block on another node IJ04520.
- Fix a replica mismatch which can occur during restripefs -m or -r and disks are going down and coming up and nodes are going down and coming up IJ04658.
- Fix an assert that can occur initializing certain maintenance commands IJ04520.
- Fix a problem in filesystem recovery where it left the filesystem in a state that might cause other filesystem cmds to hang IJ04520.
- Fix a rare timing assertion when the file system is force unmounted at the same time that quota files are being flushed to disk IJ04520.
- Fix corrupt entries being added to the Ganesha exports configuration file which can occur during the mmfns export add command if all white space is entered for the --client option's argument IJ04520.
- Fix a problem where on AIX the mmcrnsd call clears out the PVID that was assigned by the OS IJ04657.
- Fix code to avoid unnecessary file system panic and unmount on client nodes during mmchdisk start command. The file system panic/unmount could occur when a disk that has been started became unavailable again in the middle of mmchdisk start command IJ04656.
- Fix code so that mmfsd kill does not give IO error to NFS client IJ04520.
- Fix an issue where recovery was stuck on the local cluster due to Gateway node changes in a remote cluster environment IJ04520.
- Fix an exception in AclDataFile::findAcl() that can occur during a node being expelled IJ04520.
- Fix a "Error validating trailer version..." error that can occur when a RG resigns IJ04520.
- Fix an assert like "logAssertFailed: OWNED_BY_CALLER(lockWordCopy, lockWordCopy)" when trying to revive a defective pdisk in RGCK IJ04520.
- Fix a problem where reading symbolic links pointing to nothing at home can cause Assert at the cache site IJ04659.
- Fix assert "exp(isAllocListChanging())" which may occur during a SGPanic IJ04520.
- Fix a crash that can occur on a Gateway node that can occur while updating the policy attribute from home IJ04668.
- Fix an assert that can occur in an AFM environment while running mmunlinkfileset IJ04970.
- Fix the inode indirection level assert that can happen during the failure process of clone file creation IJ04520.
- Fix a "logAssertFailed: *nReservedP == 0, 5932, vbufmgr2.C" assert that can occur while vtrack data corruptions are being detected and fixed IJ04520.
- Fix a problem where a psnap creation on a gateway node, also serving as the FS manager can deadlock when the fileset in question is in need of a recovery IJ04520.
- This fix enables buffer dirty bits debug data to be collected under "debugDataControl heavy" and trace level "fs 4" IJ04662.
- Fix a signal 11 that can happen when there is a race condition between daemon startup, file system mount and snapshot quiesce rpc handling IJ04520.
- Fix a deadlock that can occur if 2 or 3 filesets are in recovery and a gateway node is involved IJ04520.
- Fix a potential assert when a compressed file is updated in the last data block causing a COW to the snapshot that was recently read IJ04520.
- Fix a performance issue in the AFM environment where small size file replications are improved over high latency networks. Feature can be enabled by setting the afmIOFlags=4 IJ04667.
- Fix a problem where mmapplypolicy crashes or loops indefinitely after apparently completing all the work it should have done. Should be unusual, as it only applies when there was a failure of a "helper" during the execution phase IJ04520.
- Fix a problem in which the PaxosChallengeCheck thread reported as a long waiter in the GPFS log and/or the dump file IJ04663.
- Fix a problem in which gpfs_igetattrs with 1M bufferSize fails with ENOMEM IJ04664.
- Fix a problem in which Cron implementation on Ubuntu skips the files in /etc/cron.d with dots in file names IJ04520.
- Fix an assert that can occur when the data copy offset plus the copy length exceeds the file size IJ04520.
- Fix a potential assert when a compressed file is extended due to a truncation operation beyond it's original file size IJ04520.
- Fix a problem in which mmshutdown couldn't cleanup bind mounts and mmmount can't umount bind mounts on older kernels IJ04520.
- Fix mmbackup and mmimgbackup failures that can occur when used with IBM Spectrum Protect because they both use incompatible gskit libraries IJ04669.
- Fix a problem in which the mmcrvdisk command fails if the recoverygroup name contains periods. This fix applies to GSS/ESS customers IJ04520.
- Fix a node crash that can occur when the daemon is shutdown during a mmap write IJ04520.
- Fix a "bgP == __null" assert when doing a truncation operation on a compressed file IJ04520.
- Fix a double memory free issue, which may cause assert like "Assert exp(vHoldCount > 0) in vbufmgr.C:280". This can occur when there is heavy IO and pdisk errors IJ04520.
- Fix CCR client code to avoid segmentation fault during backup command IJ04520.
- Fix code to avoid segmentation fault during PaxosSharedDisk::readDblocks() in GPFS mmfsdi IJ04520.
- Fix a Signal 6 BufferDesc::traceDirtyBits at SmartPtrs.h:1329 during dumping buffer dirty bits IJ04520.
- Fix temporary file system busy state when mounting the file system right after the file system name was changed IJ04520.
- Fix a logAssert "exp(errP != NULL)" which may happen while accessing gpfs snapshots on a nfs client. The log assert is caused by a race of a file access and a snapshot deletion IJ04520.
- Fix a rare deadlock which can happen between command which changes the cluster manager (like mmchmgr, mmexpelnode, mmchnode --nonquorum) and a quorum lost event IJ04520.
- Fix a kernel panic: RIP: cxiPanic, mmfsd at: SendFlock which can occur during a heavy fcntl load and a loss of a node to cause a quorum loss IJ04520.
- Fix a "logAssertFailed: !isCfgMgr()" error which may happen after a node failure event IJ04671.
- Fix recently instroduced slow command performance. It affects server base clusters that disable mmsdrservPort IJ04672.
- Fix code to avoid unexpected GPFS cluster manager changes to other quorum nodes. This can occur on large clusters IE greater then 500 nodes during heavy IO and heavy CCR IE vputs/vgets/fputs/fgets IJ04673.
- Fix a problem in which the mmhealth monitoring daemon was running some commands twice IJ04520.
- Correct inconsistent behavior of mmnfs export list command when -Y is used IJ04520.
- The fix will prevent the daemon from crashing if user uses a large value for number of subblocks IJ04520.
- Fix Python exceptions in the mmnetverify command when the flood operation is used with a target node whose GPFS node number is greater than 255 IJ04862.
- Fix a problem in the mmadquery command not being able to list AD users if the "AD domain name" does not match the "AD domain shortname" IJ04784.
- Address a problem where cleanup on handlers can happen twice (1 called from unmount of the file system and other from a panic on the FS at the same time), and this could result in a bad memory access causing a Signal 11 IJ04520.
- Fix an issue in the AFM environment where a daemon asserts at the gateway node when a file is being removed. This happens when a file is deleted immediately after the creation and the file system is already quiesced IJ04520.
- Fix a problem in which AFM orphan entries cannot be cleaned on line on an AFM disabled fileset IJ04520.
- Fix a problem in which the "mmcesnode suspend -N
- " command
failed to suspend all nodes in the list IJ04087.
- Fix a mmfsd core dump which can occur when mmpmonSocket is receiving events IJ05223.
- This patch must be applied for all systems using Spectrum Scale RAID with write caching drives. If slow disk detection has been disabled by setting the nsdRAIDDiskPerformanceMinLimitPct config parameter to zero, it can be re-enabled by restoring this parameter to its default value IJ04864.
- Fix the false ENOENT error when operating on files in an AFM fileset IJ04520.
- Fix a problem that when pool usage exceeds the warning threshold configured by mmhealth, the message in /var/log/messages talks about "metadata" but should be "data" IJ04863.
- Fix waiting 1786.523031929 seconds, ProbeClusterThread: on ThCond 0x116D2F80 (0x116D2F80) (StripeGroupTableCondvar), reason 'waiting for SG cleanup'. This can occur during a heavy workload while nodes are being expelled IJ04520.
- Fix a mmfsd assert at: Assert exp(mdiBlockP != __null) ts/vdisk/mdIndex.C 2299. This can occur creating a very large vdisk while a repair is in progress IJ04520.
- Fix a log recovery error and file system umounts on all nodes that can occur during heavy directory create, delete, rename work load IJ05483.
- Fix a problem in which mmnfs export list -Y might list more exports than expected and is inconsistent with output when omitting option -Y IJ04520.
- Change to not mark disk down and unmount file system when adding new disk paths for disks being used by the file system IJ05258.
- Fix a problem in which the node can not be started and you see this error: /usr/lpp/mmfs/bin/runmmfs[336]: .[213]: loadKernelExt[674]: InsModWrapper[95]: eval: line 1: 18672: Memory fault. This can occur on SLES12 SP1 after upgrading the kernel to kernel 3.12.74-60.64.82-default IJ05073.
- This update addresses the following APARs: IJ04087 IJ04520 IJ04655 IJ04656 IJ04657 IJ04658 IJ04659 IJ04660 IJ04661 IJ04662 IJ04663 IJ04664 IJ04665 IJ04666 IJ04667 IJ04668 IJ04669 IJ04671 IJ04672 IJ04673 IJ04784 IJ04862 IJ04863 IJ04864 IJ04865 IJ04970 IJ05073 IJ05223 IJ05258 IJ05483.
Problems fixed in IBM Spectrum Scale 4.2.3.7 [February 16, 2018]
- Fix a problem in which if inode expansion is interrupted, it may leave nAllocatedInodes inconsistent between sg descriptor and fileset metadata file IJ03086.
- Fix a problem to avoid filling up /var/adm/ras/ with ever growing mmsysmon.log on AIX IJ02566.
- Fix a problem reading files from a snapshot with mmap on AIX IJ02566.
- Fix an issue in the AFM environment where daemon asserts during the AIO writes on AFM filesets IJ02566.
- Fix code to recover corrupted files in CCRs committed directory during GPFS startup which was causing other components to fail IJ03085.
- Fix E_VALIDATE errors on ACL blocks after disk outage IJ02566.
- Fix a problem where a user cannot make changes to the afmTarget once he has created the fileset with a wrong mapping name (or) host name in the afmTarget field IJ02566.
- Fix a race condition where we skip generating a dmapi close event IJ02628.
- Fix a problem in which one way reconnect misses sending the cleanup RPC which results in performance degradations IJ02627.
- Fix node hangs due to the consumption of DMAPI event mailboxes IJ03083.
- Fix a log assert which may happen during mmdelsnapshot if the file in snapshot has DITTO xattr overflow block address IJ02566.
- If HAWC is enabled for a file system for which log recovery has failed, the recovery log is no longer dumped, because the recovery log may contain user data. Also, dump files are now created with more restricted permissions IJ02566.
- Fix an issue in the AFM environment where gateway node hangs under the stress due to mailboxes unavailability IJ02566.
- Fix an issue in the AFM environment where daemon asserts while mounting the fileset target path. This happens when AFM is not enabled at home IJ02566.
- This fix adds a SG Panic check for message MBHashFetch and MBHashFetchAsync IJ02566.
- This fix Stops the monitor immediately when a node leaves a cluster to avoid unexpected behavior when the node is re-added to the cluster IJ02566.
- Fix the potential duplicated RPC issue while doing network reconnect IJ02566.
- This fix optimizes closing a file system to enable mmfsck process to start ASAP IJ02566.
- This fix stops recovery of a fileset, if the recovery fails with error 78(TIMEDOUT) IJ02566.
- This fix avoids gpfsReserveDelegation exceptions for kworker return of nfs4 leases IJ02566.
- This fix adds a SG Panic check in data block flush IJ02566.
- Fix an issue in the AFM environment where a gateway node runs into a soft lockup issue with UID remapping enabled. This happens when cache and home are running on different architectures IJ02566.
- Fix a problem where we need to treat NFS bad filehandle errors as STALE during AFM failover/DR changesecondary and thereby continuing replication without having to drop the queue IJ02566.
- This fix Increases the default value of the socketMaxListenConnections configuration variable to 8192 on Linux IJ02566.
- Fix a problem in which the mmlsquota and the mmsetquota manpages were missing a reference to mmrepquota -t IJ02630.
- Fix sample script filehist to avoid a fail with divide by zero error. This can occur if the file system has different block sizes. It can also occur if there is no /dev IJ03084.
- The command mmkeyserv tenant show now also shows the RKM Id information IJ02566.
- Fix a pdisk state transition issue when the disk drive power on procedure takes longer time than expected. When bringing up this kind of pdisk from power off state, sometimes mmchpdisk --revive may report error and pdisk goes missing instead of becoming ok IJ02566.
- Fix a problem where a deadlock can happen between application IO to the AFM fileset when the home/secondary site fileset has gone stale IJ02566.
- Fix assert exp(!"oldDiskAddrFound.compAddr(*oldDiskAddrP)") which may happens when preallocating data in an inode file. Note, fallocate() on GPFS file system or write()/fallocate() on a FPO file system can trigger preallocation IJ03162.
- Fix an issue in the AFM environment where some files are moved to .ptrash directory intermittently over GPFS backend IJ03095.
- Fix a problem in which the RGCM keeps causing RG to be resigned and recovery failure when primary node is down IJ03248.
- Fix a problem with Receive Worker threads going CPU bound IJ03087.
- Fix a deadlock involving failed "mmfsctl resume" command, SG panic and disk issues IJ02566.
- Fix the pending log file migration assert that can happen when doing file system restripe operation or adding/deleting/changing the file system disks IJ02566.
- Fix erratic inode expansion behavior and spurious 'Expanded inode space' log messages under multi-node create workloads IJ02566.
- Fix a rare case that truncate() does not set file size correctly. The file size is set to full block boundary incorrectly and the fragment is lost IJ03091.
- Fix a problem in which GNR deadlocks with mmdiag --waiters showing many threads in state "wait for log buffers". This occurs some time after errors were encountered on log tip devices IJ02566.
- Fix an issue in the AFM environment where already existing uncached files are not prefetched correctly IJ02631.
- Fix log assert "Assert exp(totalSteps >= 0) in file workthread.C". It happens when running mmlsfileset -r or mmlsfileset -i command against a file system which has huge inode number or lots independent filesets IJ03234.
- Fix a problem where pmsensor service crashes because there are NULL entries returned from mmpmon for AFM filesets IJ03101.
- Fix a problem in which we see mismatched replicas in the aclFile after down disks IJ02566.
- Fix a problem that occurs when the command "mmhealth cluster show" is run with an invalid component name IJ02566.
- Fix a problem in which we did not show filesystem information in mmhealth output for AIX IJ02566.
- This fix improves the monitor for handling of failed disks IJ02566.
- Fix a problem in which the installer fails with error "Not all services available on specified nodes" IJ02566.
- This fix improves the output of mmhealth to be more helpful for problem determination IJ02566.
- Fix unnecessary failovers caused by starting multiple instances of the same monitor IJ02566.
- Fix a problem in which the mmsysmon.log was not being rotated daily on RHEL 6/SLES 11 IJ02566.
- Fix a problem in which the 'mmnfs export list' command cannot be piped to 'head' IJ02566.
- Fix an issue in the AFM environment where files are moved to .ptrash during the rename on independent-writer mode filesets IJ02566.
- Fix a mount fail that occurs when someone has an environment lock IJ02566.
- Fix fcntl performance issue IJ03096.
- Fix a problem with mmsmb exportacl list producing wrong output IJ03166.
- Fix an issue in the AFM environment where gateway node crashes intermittently. Also fix an issue where lookup returns incorrect results IJ02566.
- Fix display issues on the GUI for health data, this can occur if entities are called similar to health's main component (e.g. "ces" or "gpfs") IJ02664.
- Fix a problem in which network monitoring can leave components in checking state and exceptions in the trace log. This can occur when the name is like loop@loop IJ02566.
- This fix Ensures that server reachability is accurately reported for multiple servers with CES stack configured for LDAP authentication IJ03097.
- Fix a problem where, when recovery policy fails with an error 2, we need to rerun the policy with higher debug level for policy IJ02566.
- This fix adds a protection to prevent a compressed fragment from being expanded without being uncompressed first in some unexpected conditions of having inconsistent compression flags. This fix also replace an assert with an IO error to minimize the user impact IJ03102.
- Fix an issue that remote NSD clients drop into a long time retry loop during an ESS outage. This can occur if there are multiple ESS building blocks and GPFS replication is enabled in the cluster. When shutting down both servers of a ESS building block simultaneously, remote NSD clients can experience a long retry loop like 'waiting for stateful NSD server error takeover (1)' IJ03098.
- Fix a problem where recovery keeps failing with an error 2 because the AFM recovery script wasn't able to handle directory names in the fileset that had trailing spaces in them IJ03103.
- This fix changes the output of the command "mmcallhome" to be identical to the 5.0.0.0 output IJ02566.
- Fix Ubuntu 14.4.5:mmsysmonitor failed to start IJ02566.
- This fix prevents time outs that can occur when using mmapplypolicy with TCT (mmcloudgateway) when large instance counts are used for migrations on a given cluster node IJ03194.
- Fix a segmentation fault that can occur when you have very long file path names being read into policy generated files IJ02566.
- Fix a mount failure and mmchdisk/mmrestripefs performance issues IJ03636.
- Fix code to call the tiebreakerCheck user exit script even if the CCR is enabled IJ02566.
- Fix an issue in the AFM environment where file listing during a readdir fails for dirty files in local-updates mode. This problem happens with the ganesha NFS server having AFM local-updates mode fileset exports IJ03424.
- This fix improves the trace logging of node-to-node health event propagation IJ03689.
- Fix a problem that when open(O_TRUNC) fails (due to a share conflict), the data gets lost IJ03608.
- Fix a problem in which gpfs.snap stops with an error message when it stores (TARs) log files IJ03898.
- Fix a problem in which gpfs.snap: dmesg -T option doesn't work on SLES11 IJ02566.
- This update addresses the following APARs: IJ02566 IJ02627 IJ02628 IJ02630 IJ02631 IJ02664 IJ03083 IJ03084 IJ03085 IJ03086 IJ03087 IJ03091 IJ03095 IJ03096 IJ03097 IJ03098 IJ03101 IJ03102 IJ03103 IJ03162 IJ03166 IJ03194 IJ03229 IJ03234 IJ03235 IJ03248 IJ03353 IJ03367 IJ03424 IJ03608 IJ03636 IJ03689 IJ03898.
Problems fixed in IBM Spectrum Scale 4.2.3.6 [November 30, 2017]
- Fix potential issue with cesiplist file updates in ccr that can result in messages like "cesiplistLocalSerial is not numeric: ()" IJ02158.
- This fix excludes hidden CCR files from the scheduled callhome data collection IJ00977.
- Fix code to avoid long running CCR synods on different quorum nodes causing long running GPFS 'mmgetstate -a' command IJ00977.
- Fix an issue in the AFM environment where AFM prefetch causes daemon assert if the directories are deleted after the prefetch queueing IJ00977.
- This fix respects the mmdelsnapshot -N option when resuming DeleteRequired snapshot IJ02220.
- Fix a rare assert that can occur during metanode takeover due to a stalled indirect block left in the cache IJ00977.
- Fix a deadlock caused by the allocation region requests handler. Users would see long waiters on allocation manager cursors when deadlock happens IJ01063.
- Address a problem where failbackToPrimary --start command tries to delete any later snapshots than the latest snapshot present at the old primary end. But when that snapshot is not present at the acting primary - we need to handle the error rather than ignoring and continuing the failback command IJ00977.
- Fix an inode count leak problem which may happen when gpfs_iwrite/gpfs_iwritex API fails with ESTALE, tsrestorefileset utility uses this GPFS API IJ00977.
- Fix hangs and timeouts that occur during snapshot commands in rare failure cases IJ01335.
- Fix a problem where the close issued on a remote NFS mount can get stuck and causes an unlink of an AFM fileset to get stuck IJ00977.
- Fix the assert issue on generation number when flushing or writing indirect blocks. This issue only happens when the clone files were used and deleted IJ00977.
- Fix a disk address assert that can occur when a thread reads a compressed region of a file at the same time when a different node uncompresses the same region IJ00977.
- Fix a vectored DIO (writev/readv) dead lock which may happen if the filesystem is being quiesced IJ00977.
- Fix a potential infinite loop when reading a compressed file with alternating compressed and uncompressed regions IJ00977.
- Add specific handling of SKLM error messages in case a required configuration parameter in the SKLMConfig.properties file is missing. Add a more detailed error message from mmkeyserv command in case a configuration parameter is wrong or missing IJ00977.
- Fix a potential snapshot file data corruption that can be caused by a crash occurring when a compressed file is being deleted from the active file system or a snapshot IJ00977.
- Fix a NULL buffer pointer dereference problem by adding synchronization for accessing the buffer pointer IJ00977.
- Fix a problem in disk verification that wrongly calculated on disk stripe group descriptor checksum IJ01325.
- Fix a potential E_HOSTDOWN (80) error when a compressed file is being appended while the node is the process of becoming a metanode at the same time IJ00977.
- Fix an issue in the AFM environment where recovery fails with error 112 while checking for the deleted directories IJ00977.
- Fix the no space issue when running mmchdisk start command. A similar issue can happen on normal writes IJ01065.
- Add more provision to catch a case where a Queue item is becoming NULL when IO is happening to the fileset and the queue is being flushed IJ00977.
- Fix an assert that can occur when xattrs is heavily used and there is an unusual block size setting IJ00977.
- An inaccurate and unnecessary assert in buffer bitmap processing is removed IJ00977.
- Fix wrong fs struct error format IJ00977.
- Fix a deadlock situation involving lock conflict while stealing a buffer for file system metadata repair IJ00977.
- Fix a logic bug which may cause a log recovery to fail with E_RECOV_INCOMPLETE (code 234), this problem can happen on PARALLEL_LOGRECOVERY enabled builds (since GPFS 4.2.1) if log file size is bigger than 16MB IJ00977.
- Add a debugging utility to calculate checksum values of disk data IJ00977.
- Fix a problem in which mmgetstate -s may not display the correct number of quorum nodes defined in the cluster IJ01064.
- Fix a signal 7 that can occur when a compressed file is expanded in hyper allocation mode IJ00977.
- Address a problem where re-applyupdates should not invoke failbacktoprimary --start when failbacktoprimary --stop is failed due to changes that are detected at acting primary IJ00977.
- Fix assert "'false' failed" in paxosserver.C:3129 in the GPFS daemon (CCR) that happened during GPFS startup IJ00977.
- Fix an issue in the AFM environment where deleting directories from .ptrash directory fails with directory not empty error. This issue happens when the directory is deleted from home before readdir is performed at cache IJ01066.
- In a dump file, the dump directory size was incorrectly reported as PB when the unit is TB. The problem is now fixed IJ00977.
- Fix a problem in which AIX/NFS servers deadlock trying to recover a client's fcntl lock following the loss of another node in the same cluster IJ01068.
- Fix a problem in which the prefetch command did not work on a special character (--) named file IJ01067.
- When the read or write of a log vdisk fails during rebuild operation, use the IO error code to trigger resignation, as opposed to using E_OK IJ00977.
- Fix a potential assert that can occur when a compressed file is being closed after having been deleted and any compressed compression group within the file was partially copied (COW) before the file deletion IJ00977.
- This fix supports callbacks with long list of parameters IJ01069.
- With this fix the output of mmsmb export list -Y and mmsmb config list -Y is changed. It now has an additional colon at the end of the output lines IJ00977.
- Fix an issue in the AFM environment where the cached bit is not set after reading the entire file. This causes the eviction failures and also performance degradation during the write operations IJ00977.
- Address a problem where changing the backend from NFS to GPFS (or viceversa) - can cause bad filehandle errors IJ00977.
- Update directory code to avoid excessive recursion that could lead to stack overflow. Stack overflow could cause GPFS daemon to either crash with Signal 11 or get stuck in a signal handler IJ01070.
- Fix a DBGASSERT exp(bytesLeftInStride > 0) which may happen if multiple threads access the same file and (at least) one of them access the file with stride access pattern IJ01114.
- Fix service_running appearing in the mmhealth eventlog without a reason IJ00977.
- This fix is recommended if you see file system hangs requiring reboots to recover IJ00977.
- Fix a GPFS daemon assert that can occur when the inode0 file grows to more than 4B blocks IJ01087.
- Fix a deadlock scenario involving starting a disk at the time of recovery IJ01328.
- Fix an issue in the AFM environment where a prefetch can cause a filesystem quiesce not to happen when home is not responding. This will cause a deadlock at cache cluster until home starts responding IJ00977.
- This fix prevents a segmentation fault in tslspdisk. This fix applies to GSS/ESS customers IJ01327.
- Fix an issue in the AFM environment where ACLs are not updated properly in the cache with directory inheritance. This happens when users do not have permission to update the ACLs IJ00977.
- Fix an assert - exp(ioDataUpdateInProgress == 0 OR DaemonShuttingDown) which may happen if the application does IO with fuzzySequential access pattern IJ00977.
- Fix for gpfs.snap to collect CES address marker files (node-affinity information) IJ01072.
- Fix code to speed-up GPFS CCR read requests and mm-commands when reading from the CCR IJ01086.
- This fix allows customers to run the mmchenclosure command to confirm that a storage enclosure fan is reporting a failure and can be replaced. This fix applies to GSS/ESS customers IJ01330.
- In a mixed cluster where the HSM session manager runs on a 4.2.x node, the access to HSM migrated files from a 4.1.x node now works fine IJ01115.
- Fix a problem in the gpfs mmap code that can result in negative mmap counters. When a file gets memory mapped by a child process GPFS skipped incramenting mmap counters when it failed to verifying its credentials because of the number of groupids exceeded the limit. But decramented mmap counters during close time. This caused a node to crash because of the negative mmap counter IJ01913.
- Fix a performance problem in mmsmb and a minor problem with its machine readable output IJ01073.
- Fix a problem that can result in a flood of handle_network_problem_info mmhealth events. This can cause the GUI to crash IJ02010.
- Fix the high CPU usage issue on Windows due to a busy loop in a receiver thread when there are some network errors IJ01863.
- This update addresses the following APARs: IJ00977 IJ01063 IJ01064 IJ01065 IJ01066 IJ01067 IJ01068 IJ01069 IJ01070 IJ01072 IJ01073 IJ01086 IJ01087 IJ01114 IJ01115 IJ01325 IJ01327 IJ01328 IJ01330 IJ01335 IJ01863 IJ01913 IJ02010 IJ02158 IJ02220.
Problems fixed in IBM Spectrum Scale 4.2.3.5 [October 12, 2017]
- Fix a log assert "Unable to find cached PG map entry for pg X in vIndex Y". This fix will produce a GNR event log when unable to fix a media error IV99611.
- Fix a problem where the nsd was deleted and created again, then the node tried to reread disk configuration so it can update the nsd information, but network issue caused that to fail, then the node got stale nsd info that led to mount failure IV99675.
- Fix a mutex locking order problem, which can lead to a deadlock when the file system is being closed IV99611.
- Fix the use count leak on a stripe group to resolve a stripe group cleanup pending issue IV99611.
- Fix an assert 'iter->second' in the GPFS daemon (CCR) that can occur during mmshutdown IV99611.
- Fix a problem in which the CTIME is not updated correctly on files, Ganesha, IV99677.
- Fix the 93 seconds delay always seen during GPFS daemon startup on the current cluster manager node IV99611.
- Fix GPFS (CCR) logic to close used socket file descriptors just one time avoiding failed GPFS remote procedure calls IV99611.
- Fix generating unnecessary recalls when truncating migrated files IV99676.
- Fix a problem in which a file system unmount will fail if FileHeat is enabled and snapshots are present IV99611.
- Fix a problem in which the mmnfs export list command fails in an unpredictable manner IV99611.
- Fix log assert when a Windows node is added into a cluster that has an encrypted fs IV99611.
- Fix a ofP->inodeLk.get_lock_state() & (0x2000000000000000LL | 0x4000000000000000LL) assert that can occur when FileHeat is enabled and snapshots are present IV99611.
- Fix a problem in which offline fsck does not repair all ind block replicas in reserved files which can lead to more corruptions during fs use IJ00397.
- The fix affects customers using mmhealth THRESHOLD SERVICE. Fix a problem in which the mmhealth THRESHOLD state for some nodes never changes from CHECKING. This is for all platforms IV99611.
- Fix a problem in which the default grace period on a Ganesha system is not displayed correctly IV99611.
- Fix for blocked cesFailoverLock (cesFailoverLock: failed with rc 99) IV99611.
- Fix a problem in which accessing a TCT migrated file can result in a hang when thumbnail support is used IV99611.
- Fix a (delay forever" == "completed") daemon assert IV99678.
- Fix an issue in the AFM environment where incorrect filtering under certain workloads causes the writes to be dropped. This causes the replication not to happen fully and causes the data mismatch between cache/primary and home/secondary IV99796.
- The fix affects customers that have renamed the cluster and is using mmhealth THRESHOLD SERVICE. Fix a problem in which the SYSTEM HEALTH eventlog contains unhealthy alerts for pool_data, pool_metadata, and inode components even though they don't have capacity utilization problems. This can occur on any platform IV99611.
- Fix a segmentation fault that happens when the file system rebalancing fails to open the file system IV99611.
- Fix an issue in the AFM environment where incorrect entries in the prefetch list file (ex. . and ..) causes directory block corruption because AFM permits the filename as '.' to be created without validation of the input IV99679.
- Fix a performance degradation problem when running tar from an NFS client IV99709.
- Correct the %filesetName that is passed to the callback command for the usageUnderSoftQuota callback event. This can only occur on FILESET quota types IV99680.
- Fix quota usage accounting, in a file system with strict allocation "whenpossible", when not all data replicas can be allocated due to lack of space or failure groups IJ00031.
- Fix a deadlock seen while using NFS with TCT IJ00094.
- Fix possible file system corruption caused by a network reconnect IJ00398.
- This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IJ00398 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796.
Problems fixed in IBM Spectrum Scale 4.2.3.4 [August 24, 2017]
- Avoid erroneous FSSTRUCT error in rare cases after a SG panic.
- Fix a problem in which ESS server node deadlocks with many threads showing 'wait for log buffers'.
- Fix a memory corruption issue that can occurs during/after a reconnect.
- Fix a logAssert "!IsMemoryMappingFree" which is caused by a race between mmshutdown and 'tsctl nqStatus'.
- Fix a a possible GPFS daemon assert that can occur while running the mmdelsnapshot command. The assert can happen when prefetch is reading from a snapshot that is being deleted.
- Fix an issue in AFM ADR environment where secondary mount failure causes kernel crash.
- Ensure a copy of the keystone and auth config is created when the Object protocol is disabled and uninstalled from the protocol environment.
- Fix an issue in AFM environment where unresponsive remote mount causes synchronous operations like Reads to fail intermittently.
- Address a problem in resync/failover/changeSecondary where while recreating a deleted file at home/secondary it might cause an invalid memory access and cause the daemon to crash.
- Fix a condition where cNFS on SLES12 or later fail to restart statd.
- Fix deadlock that can occur during inode cleanup in Linux kernel 3.13 and later.
- Address a problem where applyupdates should not be started until failback --start is completed successfully.
- Fix an incorrect assertion which can go off when the file system manager is brought down while running one of the following commands mmrestripefs/mmdeldisk/mmrpldisk.
- Fix an issue in the AFM environment where a fileset unlink or a unresponsive remote mount can cause a deadlock.
- Improve snapshot command error reporting when batching is used.
- Fix a problem with the GPFS file system metadata scanning function in IBM Spectrum Scale 4.2.3.0 - 4.2.3.3 which may result in file system data or metadata corruption on certain failures, like run out of GPFS pagepool memory while running mmrestripefs, mmdeldisk, mmrpldisk, or mmadddisk -r.
- Return ESTALE when file referenced by FileHandle is deleted instead of ENOENT.
- Fix a kernel assert caused by missing buffer lock checking.
- Address a problem where applyUpdates generates operations on files/dirs that were removed from primary - but never played to secondary and later applyUpdates fail to pull such files/dirs back.
- Fix a problem in which GPFS was returning EBADF when Ganesha provided an fd which is not a GPFS fd.
- Fix a mmfsd daemon crashes after upgrading GNR code level and issuing mmchconfig release=LATEST.
- Fix an issue in the AFM environment where mmgetstate or mmdiag commands causes daemon crash if handlers are being disabled.
- Fix mmsetquota to work with non-standard username.
- Fix a deadlock issue in AFM environment when new gateway node joins the cluster and it takes over the fileset from existing gateway node where the workload is running.
- Change in lockd behavior on SLES12 SP2 may cause it to reboot while recovering another cNFS node.
- Fix mmlsquota -u: to work with a non-standard username.
- Fix for missing arping command path declaration for CentOS.
- Fix mmlscluster: to correctly output the "Remote shell command" line.
- This fix reduces the snap processing time for clusters which have the Object protocol deployed with the Unified File and Object feature enabled.
- Fix a problem in which mmsmb exportacl list shows SID instead of user name for all users but the first.
- Fix a memory leak in GNR that may cause mmfsd heap memory usage to increase over time, particularly when the workload does many small writes. The problem occurs in an ESS or GSS environment.
- Fix a daemon crash in the AFM environment where a replication error or a fileset unlink causes a memory handler to be accessed incorrectly.
- Fix a bug where failure to execute the mmdevdiscover script resulted in all pdisks to temporarily lose their I/O paths. This caused the workload to pause while paths were recovered. Sometimes it caused the recovery group to fail over to the backup node. In a few instances, it resulted in the unmount of the file system, requiring manual intervention to restore service.
- This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063.
Problems fixed in IBM Spectrum Scale 4.2.3.3 [July 27, 2017]
- Fix an issue in AFM environment where lookup and metatdata operations on same file from different nodes causes the daemon assert.
- Allow a user to specify the afmrpo interval in weeks[W],days[D] or hours[H].
- Fix a fcntl revoke handler exception that can occur after an EIO error.
- Fix a pdisk firmware version attribute issue. After the pdisk firmware level is changed, mmlspdisk still shows the old version which may mislead the administrator.
- Change EIO to ESTALET for open operation of a file that was deleted.
- Fix a race condition that causes command to fail with "invalid version on put".
- Don't allow kernel modules to cleanup when removing gpfs.gpl if gpfs.gplbin is currently installed.
- Fix an Assert: exp(dm != inv) L-813 in ../fs/fsop.C which can occur when trying to resend a read event.
- Fix a mmfsd crashes, due to a Signal 6 (abort). This can happen removing socket connections in a CCR environment.
- Fix a very rare fault that can occur during heavy directory update workloads.
- Fix a problem in which mmchfirmware --type storage-enclosure fails if running in adminMode=central where only one node has ssh privileges. This fix applies to GSS/ESS customers.
- The mmlsmount command has been changed on all platforms. The change only affects the output format of the -Y argument only when IPv6 address is used.
- This change does not allow mmchcluster -p LATEST on a CCR enabled cluster.
- This fix a problem in which unnecessary allocation manager cursors are being consumed.
- Fix an assert "ioStatsP == __null" that can occur when creating a file system with "-v yes" option, after the "fastest" readReplicaPolicy is enabled.
- Fix a err 112 rename failure that can occur during recovery for IW fileset.
- Fix an AIX kernel crash due to assert "freeing vnode not on gnode list".
- Fix an issue in the AFM environment where an unresponsive target causes a queue to be dropped during the attribute setting.
- Fix a problem where a verbsRdmaSend enable node sent excessive nsdMsgRdmaPrepare to an AIX node.
- Fix a replicas mismatch problem caused by mmrestripefs -b wrongly resetting the missupdate flag.
- Fix CCR and/or mmsdrcli-RPC request errors that can occur during authentication of the incoming socket connections.
- Address a problem where renames across directories do not reset the dirty bit which in future leads to a big list of dirty directories and hence recovery on AFM filesets might take longer to scan.
- Fix an issue in the AFM environment where UIDs in ACLS are not remapped during replication over NSD protocol when UID remapping is enabled.
- When an AFM fileset is to be converted to a regular independent fileset, first check for incomplete dirs, uncached files and orphans. If found inform the user to run prefetch, prior to the conversion.
- Fix a condition where mmautoload may hung.
- Fix a mmlogsort failure that can occur when mmlogsort attempts to query the time zone information on a node that is down.
- Fix a gpfs.snap command failure in a sudo wrapper environment if the legacy log timestamp format is used.
- Fix a divide by zero problem, when running mmrestripefs, which is specific to a file system using directly attached disks only with no NSD servers defined.
- Fix a dynassert 'mmapFlushSXLock.isLockedShared' which may fail as a secondary failure while daemon is shutting down.
- Fix a problem in which a CES ip could not be removed from a node. This can occur when problems are occurring during a CES ip move or failover (e.g. network issues, CCR issues, quorum loss). Subsequent runs of mmcesnetworkmonitor did not fix this and the ip remain active on a node where it should not.
- This fix adds more group memberships (up to 2048) on AIX.
- Fix a "freeSpace != __null" assert. This issue could only happen when doing file system rebalance after suspending some disks.
- Fix a problem in which you get a Unable to create file in fileset error even if the inode limit is not reached which is most likely to occur if the user fills up the fileset from a single node.
- Fix mmcommon test scpwrap.
- This update addresses the following APARs: IV97601 IV97676 IV97677 IV97678 IV97680 IV97681 IV97682 IV97683 IV97685 IV97693 IV97808 IV97836 IV98052 IV98053 IV98054 IV98058.
Problems fixed in IBM Spectrum Scale 4.2.3.2 [June 21, 2017]
- Address a problem where AFM recovery stalls on a read of an IW fileset when it waits to fetch the file from home after the recovery completes.
- Improve a conditional ccr update for CES IPs list file.
- Fix a problem that causes RenameHandler long waiter. This can occur if PIT is in progress.
- Fix a Ganesha crash that can occur when the user enters a string which contains a colon in any mmnfs command that requires a client option or client list string.
- Fix an assert that can occur when changing gateway nodes to non gateway nodes while operations are being performed on an iw fileset and then the non gateway nodes are turned back into gateway nodes.
- Fix a rare deadlock that can occur between a thread handling mmap and a thread handling a memory map pagefault.
- Fix an E_STALE failure that can occur when during a DMAPI dm_read_invis.
- Fix long waiters that can occur on a very busy system doing background snapshot deletion.
- Fix a case where GPFS skipped shrinking lastdata block which causes excess space to be consumed.
- Improve the mmsetquota error message that occurs when a block limit is specified in 'T' unit and larger than 909T is specified.
- Fix an assert that can occur when mmcheckquota and mmrepquota are passed fileset ids from deleted filesets.
- Fix an issue in the AFM environment where afmHardMemThreshold configuration value is not honored and more memory is used than specified.
- Increase the wait time for commands to execute, before failing.
- Correct formatting of large call counts reported by "mmfsadm vfsstats".
- Fix long waiters that can occur after a file system panic and a very busy system.
- Fix a problem in which inodes become Busy after unmount with NFS and immutable files.
- mmkeyserv: Make it possible to set certain attributes to the default with the use of "delete" or "default" keyword.
- Fix a problem in which mkdir, creates, and resync can fail during revalidation from cache/primary to home/secondary in newer kernels 3.18 or later.
- Fix a problem in which GPFS can not handle errors that occur when a DM application was unable to retrieve data due to offline tape.
- Suppress repeated message "Expanded ... inode space N from X to Y inodes" in mmfs.log.
- Fix a rare quota management deadlock caused by error conditions such as out of disk space.
- Fix an issue in AFM+HSM environment where resync/failover/changeSecondary commands fails to replicate migrated files.
- Fix an issue in the AFM environment where a fileset force unlink could cause the daemon to crash.
- Address a problem where a gateway node can assert/crash when having more than 1024 active fileset operations occurring across different filesets on a single gateway node.
- Fix a problem in which gpfs.snap may not gather mmfs logs on AIX nodes correctly.
- Fix a clone parent file deletion performance issue.
- Fix a problem in which fsetxattr failed with ENOENT using a fd of an unlinked file.
- Fix the Assert exp(fileLockHeld != LkObj::nl) in fetchBufferM() that can occur when compression is being used.
- Fix a problem where DMAPI invis read/write fails with an err 22 when calling from non session node.
- Fix a problem in which the mmnetverify command does not correctly verify remote addresses when running many tests in parallel.
- Fix a deadlock that is very rare and can occur after running snapshot commands.
- Fix a policy problem which causes the LOWDISKSPACE callback to not trigger after a fs manager takeover when the old fs manager fails because of an abort or a lost connection.
- This update addresses the following APARs: IV96355 IV96416 IV96417 IV96418 IV96419 IV96420 IV96425 IV96426 IV96429 IV96472 IV96473 IV96474 IV96476 IV96482 IV96483 IV96487 IV96488 IV96585 IV96761 IV96762 IV96763 IV96764 IV96783 IV96786 IV96791.
Problems fixed in GPFS 4.2.3.1 [May 16, 2017]
- Fix a Ganesha crash caused by an applyUpdate.
- Fix a ccrio initialization failure (err 811) when changing the daemon-interface.
- Fix a rare segmentation fault in the mmgetstatus command.
- Fix a SIGBUS error that can occur during a mmap read on a snapshot file.
- Fix a problem in which we see a flood of "failed to scrub vdisk" log message when GNR node experiences quorum loss. This is for ESS/GSS.
- Fix a rare race between unlink, lookup and token revoke which causes kernel crash in d_revalidate.
- This fix will make sure Ganesha request reference a valid GPFS filesystem.
- Fix a system hang that can occur when a file system is suspended while doing a mmap.
- This fix rejects unreasonable large requests to preallocate inodes immediately with ENOSPC.
- Fix a directory rename issue with IW filesets that can occur if the rename target is an existing directory.
- Fix a fault that can occur when restripe runs while the SG is not mounted on all NSD nodes.
- This fix restricts the afmMaxParallelRecoveries config value from 0 to 128.
- This fix removes the unnecessary error message "cannot open /proc/net/tcp6" when shutting down GPFS.
- Fix a problem with not properly handling quotas in an AFM environment. This can occur when you have very large hard and soft limit values.
- Fix a "exp(!sgP->isSGMgr())" assert that can occur when you delete a file system and then create a new file system with the same name at the same time.
- Fix an err 112 that can show up in the mmfs logs when mmchnode --gateway is executed.
- Fix a kernel crash that can occur while attempting to mount a loop device to a correspond file in a GPFS file system or while using a GPFS file system file as a LIO backend.
- Address a problem where applyUpdates continues to run even if the fileset at the old primary is unlinked or the mmfs daemon has been shutdown.
- Fix an outband resync failure that can occur if a recovery is triggered by deleting some files in a directory and the directory itself. This is an AFM/DR environment.
- Fix rename conflicts that can occur in SW/DR filesets.
- Update log code to prevent log recovery error when log file became illReplicated. This could happen on file system with -K set to NO and there is not enough disk space for full replication.
- This fix will use new interface that will reduce multiple retries every time a lock is freed and there are multiple waiters for the lock.
- Fix an assert that can occur with a DR fileset and the file system is suspended.
- Fix bug that requires a large free space in /var/mmfs to run change commands.
- Fix recovery failure err 17 when psnap0 deletion fails.
- Fix a daemon assert that can occur in an AFM environment where the mmfsd daemon fails to start repeatedly with a DMAPI enabled filesystem at a gateway node.
- Address a problem where trying to queue a writeSplit message to the helper gateway's queue can fail with an error 28 (E_NOSPAC).
- Fix an issue which returns EACCESS(errno = 13) while running mmapplypolicy when there is a mounted NFS file system which has the same name with a GPFS file system.
- A fastpath optimization defect can result in an internal error to be returned to the user when it is safe to continue without entering the fast path.
- Install if you suffer from mmapplypolicy/tspolicy hanging after otherwise finishing all work.
- cNFS: fix a problem with /usr/sbin/rpcinfo not found in SLES12 or later.
- Fix a failure in Object Authentication configuration with Active Directory or LDAP. This fix is only required if Object is being configured with Active Directory or LDAP and DN of Swift service user(specified in --ks-swift-user) is more than 79 characters.
- Fix a problem with ESS disk replacement in which the mmchcarrier command may wipe out the pdisk location code. The problem will prevent the subsequent mmchcarrier command to proceed without a valid location code.
- Fix a problem in which a GPFS command may wrongly terminate another process.
- Fix a rare deadlock problem caused by stream write(enableRepWriteStream=yes).
- Update log recovery code to avoid GPFS daemon assert after detecting invalid directory block during log recovery. Code has been changed to log a FSSTRUCT error and fail the log recovery so offline mmfsck can be run on the file system.
- Fix a mmfsd crashes (incompleteOk assertion), when the number of files in the committed directory doesn't match the number of files in CCR's file list in case of a new CCR file update request.
- This update addresses the following APARs: IV94991 IV94992 IV94994 IV94995 IV94996 IV94997 IV94998 IV95015 IV95021 IV95230 IV95557 IV95643 IV95925 IV96037 IV96163.