Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 4.2.x applies for all supported platforms.
Problems fixed in IBM Spectrum Scale 4.2.3.5 [October 12, 2017]
- Fix a log assert "Unable to find cached PG map entry for pg X in vIndex Y". This fix will produce a GNR event log when unable to fix a media error IV99611.
- Fix a problem where the nsd was deleted and created again, then the node tried to reread disk configuration so it can update the nsd information, but network issue caused that to fail, then the node got stale nsd info that led to mount failure IV99675.
- Fix a mutex locking order problem, which can lead to a deadlock when the file system is being closed IV99611.
- Fix the use count leak on a stripe group to resolve a stripe group cleanup pending issue IV99611.
- Fix an assert 'iter->second' in the GPFS daemon (CCR) that can occur during mmshutdown IV99611.
- Fix a problem in which the CTIME is not updated correctly on files, Ganesha, IV99677.
- Fix the 93 seconds delay always seen during GPFS daemon startup on the current cluster manager node IV99611.
- Fix GPFS (CCR) logic to close used socket file descriptors just one time avoiding failed GPFS remote procedure calls IV99611.
- Fix generating unnecessary recalls when truncating migrated files IV99676.
- Fix a problem in which a file system unmount will fail if FileHeat is enabled and snapshots are present IV99611.
- Fix a problem in which the mmnfs export list command fails in an unpredictable manner IV99611.
- Fix log assert when a Windows node is added into a cluster that has an encrypted fs IV99611.
- Fix a ofP->inodeLk.get_lock_state() & (0x2000000000000000LL | 0x4000000000000000LL) assert that can occur when FileHeat is enabled and snapshots are present IV99611.
- Fix a problem in which offline fsck does not repair all ind block replicas in reserved files which can lead to more corruptions during fs use IJ00397.
- The fix affects customers using mmhealth THRESHOLD SERVICE. Fix a problem in which the mmhealth THRESHOLD state for some nodes never changes from CHECKING. This is for all platforms IV99611.
- Fix a problem in which the default grace period on a Ganesha system is not displayed correctly IV99611.
- Fix for blocked cesFailoverLock (cesFailoverLock: failed with rc 99) IV99611.
- Fix a problem in which accessing a TCT migrated file can result in a hang when thumbnail support is used IV99611.
- Fix a (delay forever" == "completed") daemon assert IV99678.
- Fix an issue in the AFM environment where incorrect filtering under certain workloads causes the writes to be dropped. This causes the replication not to happen fully and causes the data mismatch between cache/primary and home/secondary IV99796.
- The fix affects customers that have renamed the cluster and is using mmhealth THRESHOLD SERVICE. Fix a problem in which the SYSTEM HEALTH eventlog contains unhealthy alerts for pool_data, pool_metadata, and inode components even though they don't have capacity utilization problems. This can occur on any platform IV99611.
- Fix a segmentation fault that happens when the file system rebalancing fails to open the file system IV99611.
- Fix an issue in the AFM environment where incorrect entries in the prefetch list file (ex. . and ..) causes directory block corruption because AFM permits the filename as '.' to be created without validation of the input IV99679.
- Fix a performance degradation problem when running tar from an NFS client IV99709.
- Correct the %filesetName that is passed to the callback command for the usageUnderSoftQuota callback event. This can only occur on FILESET quota types IV99680.
- Fix quota usage accounting, in a file system with strict allocation "whenpossible", when not all data replicas can be allocated due to lack of space or failure groups IJ00031.
- Fix a deadlock seen while using NFS with TCT IJ00094.
- Fix possible file system corruption caused by a network reconnect IJ00398.
- This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IJ00398 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796.
Problems fixed in IBM Spectrum Scale 4.2.3.4 [August 24, 2017]
- Avoid erroneous FSSTRUCT error in rare cases after a SG panic.
- Fix a problem in which ESS server node deadlocks with many threads showing 'wait for log buffers'.
- Fix a memory corruption issue that can occurs during/after a reconnect.
- Fix a logAssert "!IsMemoryMappingFree" which is caused by a race between mmshutdown and 'tsctl nqStatus'.
- Fix a a possible GPFS daemon assert that can occur while running the mmdelsnapshot command. The assert can happen when prefetch is reading from a snapshot that is being deleted.
- Fix an issue in AFM ADR environment where secondary mount failure causes kernel crash.
- Ensure a copy of the keystone and auth config is created when the Object protocol is disabled and uninstalled from the protocol environment.
- Fix an issue in AFM environment where unresponsive remote mount causes synchronous operations like Reads to fail intermittently.
- Address a problem in resync/failover/changeSecondary where while recreating a deleted file at home/secondary it might cause an invalid memory access and cause the daemon to crash.
- Fix a condition where cNFS on SLES12 or later fail to restart statd.
- Fix deadlock that can occur during inode cleanup in Linux kernel 3.13 and later.
- Address a problem where applyupdates should not be started until failback --start is completed successfully.
- Fix an incorrect assertion which can go off when the file system manager is brought down while running one of the following commands mmrestripefs/mmdeldisk/mmrpldisk.
- Fix an issue in the AFM environment where a fileset unlink or a unresponsive remote mount can cause a deadlock.
- Improve snapshot command error reporting when batching is used.
- Fix a problem with the GPFS file system metadata scanning function in IBM Spectrum Scale 4.2.3.0 - 4.2.3.3 which may result in file system data or metadata corruption on certain failures, like run out of GPFS pagepool memory while running mmrestripefs, mmdeldisk, mmrpldisk, or mmadddisk -r.
- Return ESTALE when file referenced by FileHandle is deleted instead of ENOENT.
- Fix a kernel assert caused by missing buffer lock checking.
- Address a problem where applyUpdates generates operations on files/dirs that were removed from primary - but never played to secondary and later applyUpdates fail to pull such files/dirs back.
- Fix a problem in which GPFS was returning EBADF when Ganesha provided an fd which is not a GPFS fd.
- Fix a mmfsd daemon crashes after upgrading GNR code level and issuing mmchconfig release=LATEST.
- Fix an issue in the AFM environment where mmgetstate or mmdiag commands causes daemon crash if handlers are being disabled.
- Fix mmsetquota to work with non-standard username.
- Fix a deadlock issue in AFM environment when new gateway node joins the cluster and it takes over the fileset from existing gateway node where the workload is running.
- Change in lockd behavior on SLES12 SP2 may cause it to reboot while recovering another cNFS node.
- Fix mmlsquota -u: to work with a non-standard username.
- Fix for missing arping command path declaration for CentOS.
- Fix mmlscluster: to correctly output the "Remote shell command" line.
- This fix reduces the snap processing time for clusters which have the Object protocol deployed with the Unified File and Object feature enabled.
- Fix a problem in which mmsmb exportacl list shows SID instead of user name for all users but the first.
- Fix a memory leak in GNR that may cause mmfsd heap memory usage to increase over time, particularly when the workload does many small writes. The problem occurs in an ESS or GSS environment.
- Fix a daemon crash in the AFM environment where a replication error or a fileset unlink causes a memory handler to be accessed incorrectly.
- Fix a bug where failure to execute the mmdevdiscover script resulted in all pdisks to temporarily lose their I/O paths. This caused the workload to pause while paths were recovered. Sometimes it caused the recovery group to fail over to the backup node. In a few instances, it resulted in the unmount of the file system, requiring manual intervention to restore service.
- This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063.
Problems fixed in IBM Spectrum Scale 4.2.3.3 [July 27, 2017]
- Fix an issue in AFM environment where lookup and metatdata operations on same file from different nodes causes the daemon assert.
- Allow a user to specify the afmrpo interval in weeks[W],days[D] or hours[H].
- Fix a fcntl revoke handler exception that can occur after an EIO error.
- Fix a pdisk firmware version attribute issue. After the pdisk firmware level is changed, mmlspdisk still shows the old version which may mislead the administrator.
- Change EIO to ESTALET for open operation of a file that was deleted.
- Fix a race condition that causes command to fail with "invalid version on put".
- Don't allow kernel modules to cleanup when removing gpfs.gpl if gpfs.gplbin is currently installed.
- Fix an Assert: exp(dm != inv) L-813 in ../fs/fsop.C which can occur when trying to resend a read event.
- Fix a mmfsd crashes, due to a Signal 6 (abort). This can happen removing socket connections in a CCR environment.
- Fix a very rare fault that can occur during heavy directory update workloads.
- Fix a problem in which mmchfirmware --type storage-enclosure fails if running in adminMode=central where only one node has ssh privileges. This fix applies to GSS/ESS customers.
- The mmlsmount command has been changed on all platforms. The change only affects the output format of the -Y argument only when IPv6 address is used.
- This change does not allow mmchcluster -p LATEST on a CCR enabled cluster.
- This fix a problem in which unnecessary allocation manager cursors are being consumed.
- Fix an assert "ioStatsP == __null" that can occur when creating a file system with "-v yes" option, after the "fastest" readReplicaPolicy is enabled.
- Fix a err 112 rename failure that can occur during recovery for IW fileset.
- Fix an AIX kernel crash due to assert "freeing vnode not on gnode list".
- Fix an issue in the AFM environment where an unresponsive target causes a queue to be dropped during the attribute setting.
- Fix a problem where a verbsRdmaSend enable node sent excessive nsdMsgRdmaPrepare to an AIX node.
- Fix a replicas mismatch problem caused by mmrestripefs -b wrongly resetting the missupdate flag.
- Fix CCR and/or mmsdrcli-RPC request errors that can occur during authentication of the incoming socket connections.
- Address a problem where renames across directories do not reset the dirty bit which in future leads to a big list of dirty directories and hence recovery on AFM filesets might take longer to scan.
- Fix an issue in the AFM environment where UIDs in ACLS are not remapped during replication over NSD protocol when UID remapping is enabled.
- When an AFM fileset is to be converted to a regular independent fileset, first check for incomplete dirs, uncached files and orphans. If found inform the user to run prefetch, prior to the conversion.
- Fix a condition where mmautoload may hung.
- Fix a mmlogsort failure that can occur when mmlogsort attempts to query the time zone information on a node that is down.
- Fix a gpfs.snap command failure in a sudo wrapper environment if the legacy log timestamp format is used.
- Fix a divide by zero problem, when running mmrestripefs, which is specific to a file system using directly attached disks only with no NSD servers defined.
- Fix a dynassert 'mmapFlushSXLock.isLockedShared' which may fail as a secondary failure while daemon is shutting down.
- Fix a problem in which a CES ip could not be removed from a node. This can occur when problems are occurring during a CES ip move or failover (e.g. network issues, CCR issues, quorum loss). Subsequent runs of mmcesnetworkmonitor did not fix this and the ip remain active on a node where it should not.
- This fix adds more group memberships (up to 2048) on AIX.
- Fix a "freeSpace != __null" assert. This issue could only happen when doing file system rebalance after suspending some disks.
- Fix a problem in which you get a Unable to create file in fileset error even if the inode limit is not reached which is most likely to occur if the user fills up the fileset from a single node.
- Fix mmcommon test scpwrap.
- This update addresses the following APARs: IV97601 IV97676 IV97677 IV97678 IV97680 IV97681 IV97682 IV97683 IV97685 IV97693 IV97808 IV97836 IV98052 IV98053 IV98054 IV98058.
Problems fixed in IBM Spectrum Scale 4.2.3.2 [June 21, 2017]
- Address a problem where AFM recovery stalls on a read of an IW fileset when it waits to fetch the file from home after the recovery completes.
- Improve a conditional ccr update for CES IPs list file.
- Fix a problem that causes RenameHandler long waiter. This can occur if PIT is in progress.
- Fix a Ganesha crash that can occur when the user enters a string which contains a colon in any mmnfs command that requires a client option or client list string.
- Fix an assert that can occur when changing gateway nodes to non gateway nodes while operations are being performed on an iw fileset and then the non gateway nodes are turned back into gateway nodes.
- Fix a rare deadlock that can occur between a thread handling mmap and a thread handling a memory map pagefault.
- Fix an E_STALE failure that can occur when during a DMAPI dm_read_invis.
- Fix long waiters that can occur on a very busy system doing background snapshot deletion.
- Fix a case where GPFS skipped shrinking lastdata block which causes excess space to be consumed.
- Improve the mmsetquota error message that occurs when a block limit is specified in 'T' unit and larger than 909T is specified.
- Fix an assert that can occur when mmcheckquota and mmrepquota are passed fileset ids from deleted filesets.
- Fix an issue in the AFM environment where afmHardMemThreshold configuration value is not honored and more memory is used than specified.
- Increase the wait time for commands to execute, before failing.
- Correct formatting of large call counts reported by "mmfsadm vfsstats".
- Fix long waiters that can occur after a file system panic and a very busy system.
- Fix a problem in which inodes become Busy after unmount with NFS and immutable files.
- mmkeyserv: Make it possible to set certain attributes to the default with the use of "delete" or "default" keyword.
- Fix a problem in which mkdir, creates, and resync can fail during revalidation from cache/primary to home/secondary in newer kernels 3.18 or later.
- Fix a problem in which GPFS can not handle errors that occur when a DM application was unable to retrieve data due to offline tape.
- Suppress repeated message "Expanded ... inode space N from X to Y inodes" in mmfs.log.
- Fix a rare quota management deadlock caused by error conditions such as out of disk space.
- Fix an issue in AFM+HSM environment where resync/failover/changeSecondary commands fails to replicate migrated files.
- Fix an issue in the AFM environment where a fileset force unlink could cause the daemon to crash.
- Address a problem where a gateway node can assert/crash when having more than 1024 active fileset operations occurring across different filesets on a single gateway node.
- Fix a problem in which gpfs.snap may not gather mmfs logs on AIX nodes correctly.
- Fix a clone parent file deletion performance issue.
- Fix a problem in which fsetxattr failed with ENOENT using a fd of an unlinked file.
- Fix the Assert exp(fileLockHeld != LkObj::nl) in fetchBufferM() that can occur when compression is being used.
- Fix a problem where DMAPI invis read/write fails with an err 22 when calling from non session node.
- Fix a problem in which the mmnetverify command does not correctly verify remote addresses when running many tests in parallel.
- Fix a deadlock that is very rare and can occur after running snapshot commands.
- Fix a policy problem which causes the LOWDISKSPACE callback to not trigger after a fs manager takeover when the old fs manager fails because of an abort or a lost connection.
- This update addresses the following APARs: IV96355 IV96416 IV96417 IV96418 IV96419 IV96420 IV96425 IV96426 IV96429 IV96472 IV96473 IV96474 IV96476 IV96482 IV96483 IV96487 IV96488 IV96585 IV96761 IV96762 IV96763 IV96764 IV96783 IV96786 IV96791.
Problems fixed in GPFS 4.2.3.1 [May 16, 2017]
- Fix a Ganesha crash caused by an applyUpdate.
- Fix a ccrio initialization failure (err 811) when changing the daemon-interface.
- Fix a rare segmentation fault in the mmgetstatus command.
- Fix a SIGBUS error that can occur during a mmap read on a snapshot file.
- Fix a problem in which we see a flood of "failed to scrub vdisk" log message when GNR node experiences quorum loss. This is for ESS/GSS.
- Fix a rare race between unlink, lookup and token revoke which causes kernel crash in d_revalidate.
- This fix will make sure Ganesha request reference a valid GPFS filesystem.
- Fix a system hang that can occur when a file system is suspended while doing a mmap.
- This fix rejects unreasonable large requests to preallocate inodes immediately with ENOSPC.
- Fix a directory rename issue with IW filesets that can occur if the rename target is an existing directory.
- Fix a fault that can occur when restripe runs while the SG is not mounted on all NSD nodes.
- This fix restricts the afmMaxParallelRecoveries config value from 0 to 128.
- This fix removes the unnecessary error message "cannot open /proc/net/tcp6" when shutting down GPFS.
- Fix a problem with not properly handling quotas in an AFM environment. This can occur when you have very large hard and soft limit values.
- Fix a "exp(!sgP->isSGMgr())" assert that can occur when you delete a file system and then create a new file system with the same name at the same time.
- Fix an err 112 that can show up in the mmfs logs when mmchnode --gateway is executed.
- Fix a kernel crash that can occur while attempting to mount a loop device to a correspond file in a GPFS file system or while using a GPFS file system file as a LIO backend.
- Address a problem where applyUpdates continues to run even if the fileset at the old primary is unlinked or the mmfs daemon has been shutdown.
- Fix an outband resync failure that can occur if a recovery is triggered by deleting some files in a directory and the directory itself. This is an AFM/DR environment.
- Fix rename conflicts that can occur in SW/DR filesets.
- Update log code to prevent log recovery error when log file became illReplicated. This could happen on file system with -K set to NO and there is not enough disk space for full replication.
- This fix will use new interface that will reduce multiple retries every time a lock is freed and there are multiple waiters for the lock.
- Fix an assert that can occur with a DR fileset and the file system is suspended.
- Fix bug that requires a large free space in /var/mmfs to run change commands.
- Fix recovery failure err 17 when psnap0 deletion fails.
- Fix a daemon assert that can occur in an AFM environment where the mmfsd daemon fails to start repeatedly with a DMAPI enabled filesystem at a gateway node.
- Address a problem where trying to queue a writeSplit message to the helper gateway's queue can fail with an error 28 (E_NOSPAC).
- Fix an issue which returns EACCESS(errno = 13) while running mmapplypolicy when there is a mounted NFS file system which has the same name with a GPFS file system.
- A fastpath optimization defect can result in an internal error to be returned to the user when it is safe to continue without entering the fast path.
- Install if you suffer from mmapplypolicy/tspolicy hanging after otherwise finishing all work.
- cNFS: fix a problem with /usr/sbin/rpcinfo not found in SLES12 or later.
- Fix a failure in Object Authentication configuration with Active Directory or LDAP. This fix is only required if Object is being configured with Active Directory or LDAP and DN of Swift service user(specified in --ks-swift-user) is more than 79 characters.
- Fix a problem with ESS disk replacement in which the mmchcarrier command may wipe out the pdisk location code. The problem will prevent the subsequent mmchcarrier command to proceed without a valid location code.
- Fix a problem in which a GPFS command may wrongly terminate another process.
- Fix a rare deadlock problem caused by stream write(enableRepWriteStream=yes).
- Update log recovery code to avoid GPFS daemon assert after detecting invalid directory block during log recovery. Code has been changed to log a FSSTRUCT error and fail the log recovery so offline mmfsck can be run on the file system.
- Fix a mmfsd crashes (incompleteOk assertion), when the number of files in the committed directory doesn't match the number of files in CCR's file list in case of a new CCR file update request.
- This update addresses the following APARs: IV94991 IV94992 IV94994 IV94995 IV94996 IV94997 IV94998 IV95015 IV95021 IV95230 IV95557 IV95643 IV95925 IV96037 IV96163.