Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 4.1.x applies for all supported platforms.
Problems fixed in IBM Spectrum Scale 4.1.1.21 [September 6, 2018]
- Fix deadlock, waiters look like this: PcacheHandlerThread: on ThCond 0x7FE7DC4ACD70 (MsgRecordCondvar), reason 'RPC wait' for pcacheMsgOpen on node. This is AFM. IJ08065
- Update logging code to prevent possible long waiters where thread can get stuck waiting on 'force wait on active buffer to become stable'. This can happen if file system panic occurs while a thread is actively appending records to log file. IJ08065
- Fix a mmlsquota endless loop. This can happen if the command enters with a syntax error like: /usr/lpp/mmfs/bin/mmlsquota -u --block-size auto data and you are using using local LDAP name service. IJ08097
- Fix assert exp(secSendCoalBuf != __null && secSendCoalBufLen > 0) that can occur while doing secure sending. IJ08065
- Fix an issue in AFM environment where gateway node crashes due to the race between threads doing lookup and revalidation of files in the same directory. IJ08098
- Fix Assertion `thisSnap.isSnapOkay() || thisSnap.isSnapEmptying() || thisSnap.getSnapId() == sgP->getEaUpgradeSnapId()' that can occur when closing a file in a snapshot that is being deleted. IJ08103
- This fix introduces a second default encryption configuration string to improve performance. IJ08065
- Fix a deadlock between a thread that is closing a file and a filesystem quiesce. The waiters look like this: PIT_workerThread: on ThCond 0x18017923400 (MailboxReplyCondvar), reason 'waiting for mailbox reply' and CloseHandlerThread: on ThCond 0x18017D83220 (MailboxReplyCondvar), reason 'waiting for mailbox reply' IJ08065
- The fix improved error message for debugging purpose only when a deleted node was detected still up. When a node was detected belonging to two clusters, the fix asserts the node from the cluster manager instead of shutting down the local node that detected the moved node. IJ08387
- Fix PANIC: "BUG: unable to handle kernel paging request in gpfsCleanup. IJ08065
- Fix "Assert exp(!synchedStale)" that can happen during an access of a compressed file. IJ08065
- Fix a problem in which mmtrace self starts. This can happen if CCR is disabled, adminMode=allToAll and mmsdrservPort=0. IJ08096
- Enhanced AFM to be able migrate the EAs/ACLs from read-only exports at home in AFM local-updates and read-only fileset modes. IJ08099
- Fix a rare case logAssert "Assert:(indIndex & 0xFF00000000000000ULL)==0 IndDesc.h" which can happen when write beyond EOF of a file which has lots EA entries. IJ08065
- Fix a problem in which mmcrfs --profile returns error when both defaultMetadataReplicas and maxMetadataReplicas are specified in the profile. IJ08100
- Fix translation of POSIX ACLs applied on a GPFS Unix node to access permissions on a GPFS Windows node. IJ08101
- Increase the maximum supported number of extra IP addresses to 64. IJ08102
- Fix an issue in the AFM environment where if root user have supplementary GID greater than value 42949676, replication might fail and messages are requeued. IJ08483
- Fix a problem in which the file size could be updated incorrectly causing file size to smaller by 1 block then it should be. This could happen during a node failure. IJ08388
- Fix an issue in the AFM environment where AFM does not set the cached bit on the small sized files even though file is fully cached. IJ08389
- Windows ACLs of children folders and files could display incorrect inheritance flags. When a parent directory has both inherited as well as explicit ACLs set on it, a newly created folder/file under this parent will be correctly assigned any inheritable ACEs from the parent. However, the inheritance flags on these inherited ACEs could become inconsistent resulting in the Windows Explorer Security interface displaying wrong inheritance behavior. Also, the root directory of a GPFS drive is incorrectly allowing its ACLs to be changed. Trigger: Usage of complex ACLs in a deeply nested directory structure, wherein an intermediate parent folder has both inherited and explicit ACLs. Attempting ACL modification of GPFS root directory from a Windows node. Work Around: None. Symptom: Unexpected Results/Behavior. Platforms affected: Windows only. Functional Area affected: Windows ACLs/Inheritance. Customer Impact: Moderate/Suggested. IJ08573
- This update addresses the following APARs: IJ08065 IJ08096 IJ08097 IJ08098 IJ08099 IJ08100 IJ08101 IJ08102 IJ08103 IJ08387 IJ08388 IJ08389 IJ08573
Problems fixed in IBM Spectrum Scale 4.1.1.20 [June 1, 2018]
- Fix a log file migration assert that can happen when doing file system restripe operations or when adding/deleting/changing the file system disks. IJ05320
- Fix a problem that if inode expansion is interrupted, it may leave nAllocatedInodes inconsistent between sg descriptor and the fileset metadata file. IJ05605
- Fix a deadlock issue caused by the allocation region requests handler. Users would see long waiters on the allocation manager cursors when the deadlock occurs. This can happen when file system is just about full. IJ05611
- Fix a problem reading files from a snapshot with mmap from AIX. IJ05320
- Fix an inode count leak problem which may happen when gpfs_iwrite/gpfs_iwritex API fails with ESTALE, tsrestorefileset utility uses this GPFS API. This can cause a kernel crash at cxiRefOSNodeInternal during unmounting. IJ05320
- Address a problem where user cannot make changes to the afmTarget once he has created the fileset with a wrong mapping name (or) host name in the afmTarget field. IJ05320
- Fix a race condition that can cause files not to be closed. IJ05606
- Fix a log assert which may happen during mmdelsnapshot if the file in the snapshot has DITTO xattr overflow block address. IJ05320
- Address a problem where applyUpdates generates operations on files/dirs that were removed from primary - but never played to secondary and later applyUpdates fail to pull such files/dirs back. IJ05320
- Fix a problem in disk verification that wrongly calculated a disk stripe group descriptor checksum. IJ05351
- Fix a problem in which Fsck hits assert 'reaperThreadStared == 1'. IJ05320
- Fix assert exp(err==E_OK&&ibdP!=__null), at 7229 metadata.C which is caused by a SG panic not being handled. IJ05320
- This fix adds a debugging utility to calculate checksum values of on disk data. IJ05320
- Fix assert on Structure Error in ts/logger/Logger.C which can occur during a network reconnect. IJ05320
- Fix possible memory corruption problem when group quota information is retrieved by multiple clients concurrently. IJ05320
- Fix a gpfsReserveDelegation exception that is caused by a kworker returning a nfs4 lease. IJ05320
- Fix a problem with snapshot copy on write code which allowed copy on write to be skipped. This results in a FSSTRUCT error. Problem could occur when multiple nodes writes to the same empty file. IJ05320
- Fix an issue in AFM environment where a gateway node runs into a soft lockup issue with UID remapping enabled. This happens when you have cache and home running on different architectures. IJ05320
- Fix an issue in the AFM environment where ACLs are not updated properly in the cache with directory inheritance. This happens when a user does not have permission to update the ACLs. IJ05320
- This fix increases the default value of the socketMaxListenConnections configuration variable to 8192 on Linux. IJ05320
- When a file gets memory mapped by a child process GPFS skipped incramenting mmap counter when it failed verifying its credentials because of the number of groupids exceeded the limit. But it decramented the mmap counter during close time. This caused the node to crash. IJ05353
- Fix sample script filehist that may fail with divide by zero error. IJ05352
- Fix a Snapshot EA structure error Invalid XAttr overflow problem. IJ05320
- Fix a problem in which a failback command failed but has a 0 return code. IJ05320
- Fix a problem in which the mmfileid command cannot list small files, those that are less then one block size. IJ05612
- Fix a rare case that truncate() does not set file size correctly. The file size is set to full block boundary incorrectly and the fragment is lost. IJ05607
- Fix a mmapplypolicy/tsapolicy core dump: ThreadThing::check mutexthings.C:170 which can occur during an improper recovery from helper failure during a directory scan. IJ05320
- Fix a problem in which a fileset failed to execute recovery leaving the fileset in needResync cache state. IJ05320
- Fix a problem in which a directory inside an IW fileset cannot fetch new changes to the directory made at its home counterpart. This can happen following a recovery or failover that has been run on the IW fileset at the cache site. IJ05320
- Fix an issue in the AFM environment where files are moved to .ptrash during the rename on an independent-writer mode fileset. IJ05320
- Fix a locking issue during prefetching of a directory block that can lead to a FSSTRUCT error. This could happen when there is a race between expanding the first directory block on one node and prefetching of the same block on another node. IJ05320
- Fix a problem in rapid repair which could cause replicas mismatch which can occur when restripefs with option like -r migrate data off suspended disks. IJ05613
- Fix a problem in filesystem recovery where it left the filesystem in a state that might cause other filesystem cmds to hang. IJ05320
- Fix a rare timing assertion when the file system is force unmounted at the same time that quota files are being flushed to disk. IJ05320
- This fix ensures that server reachability is accurately reported for multiple servers with CES stack configured for LDAP authentication. IJ05759
- Fix a problem where on AIX a mmcrnsd call clears out the PVID that was assigned by the OS. IJ05355
- Fix code so that a mmfsd kill does not give IO errors to NFS clients. IJ05320
- Address an issue where recovery was stuck on the local cluster due to GW node changes on the remotecluster. IJ05320
- Fix a problem where, when recovery policy fails with an error 2, we need to rerun the policy with higher debug level. IJ05320
- Address a problem where recovery keeps failing with an error 2 because the AFM recovery script wasn't able to handle directory names in the fileset that had trailing spaces in them. IJ05760
- Fix an exception in AclDataFile::findAcl() during an expel. IJ05320
- Fix a problem where reading symbolic links pointing to nothing at home can cause an assert at the cache site. IJ05761
- Fix assert "exp(isAllocListChanging())" which may fail during SGPanic. IJ05320
- Fix a crash on a Gateway node that can occur while updating the policy attribute at home. IJ05763
- Fix the inode indirection level assert that can occur during the failure process of clone file creation. IJ05320
- Fix a problem where a psnap creation on a gateway node, also serving as the FS manager can deadlock when the fileset in question is in need of a recovery. IJ05320
- Fix a problem in which gpfs_igetattrs with 1M bufferSize fails ENOMEM. IJ05765
- Fix mmbackup and mmimgbackup command failures that can occur when used with IBM Spectrum Protect that includes incompatible gskit library. IJ05762
- Fix a Node crashed in the mmap write code path that can occur if the daemon was shutdown. IJ05320
- Fix a deadlock which happens between commands which change the cluster manager (like mmchmgr, mmexpelnode, mmchnode --nonquorum) and a quorum lost event. IJ05357
- Fix assert (client->state & RecLockMsgFree) != 0 which can occur during high fcntl load during a quorum loss. IJ05320
- Fix a mmfsd core dump which can occur when mmpmonSocket is receiving events. IJ05614
- Fix a false ENOENT error that can occur when operating on files in an AFM fileset. IJ05320
- Fix a deadlock scenario involving starting a disk at the time of recovery. IJ05320
- Fix a problem in which disk could be marked down or the file system could be unmount unexpectedly. This can occur when adding new disks paths for disks being used by file system. IJ05358
- Fix a problem in which Spectrum Scale 4.2.3.7 cannot be started with the following log: /usr/lpp/mmfs/bin/runmmfs[336]: .[213]: loadKernelExt[674]: InsModWrapper[95]: eval: line 1: 18672: Memory fault. This occurs on SLES12 SP1 after upgrade the kernel to 3.12.74-60.64.82-default. IJ05280
- Fix a problem in which mmlspdisk prints an invalid extraneous line of text. This fix applies to GSS/ESS customers. IJ05482
- Fix log assert "Assert exp(totalSteps >= 0) in file workthread.C". It happens when running mmlsfileset -r or mmlsfileset -i command against a file system which has a huge inode number or lots of independent filesets. IJ05320
- Package gpfs.gskit is updated to version 8.0.50.86. IJ05628
- Fix a problem in which mmbackup would produce an empty shadow database file that only contained the file header. IJ05766
- Fix a problem in which the Failbacktoprimary command fails due to the latest RPO snapshot not matching the old primary. IJ05320
- Fix a problem in which if the TSM option SCROLLPROMPT is enabled, the output from dsmc commands will pause awaiting a stdin character such as a CR to continue output. This will stall any mmbackup progress since it expects dsmc to proceed without input. IJ05320
- Fix a deadlock waiting 1786.523031929 seconds, ProbeClusterThread: on ThCond reason 'waiting for SG cleanup'. This can occur in an AFM environment when multiple threads are doing AFM initialization at the same time. IJ05320
- Fix a structure error that can occur getting block disk addresses of snapshot files. This issue could only happen when GPFS APIs are being used to access the files data. IJ05320
- Fix an issue in the AFM environment where usage of functions dm_read_invis() and dm_write_invis() on AFM filesets can result in data corruption. IJ05320
- Fix an issue in the AFM environment where AFM reads the file from home incorrectly if the data replication factor at cache is greater than one. IJ05320
- Fix mmimgbackup assert problem with symbolic link when full pathname length is 1023 bytes IJ06225.
- Fix a problem in which DMAPI dm_read_invis failed with E_STALE. This can occur during file migrations and recalls when files are being deleted. IJ05320
- Fix the issue that nodes are being expelled while there are multiple network reconnects occurring. IJ05320
- Fix a deadlock that can occur when there is a rapid repair enabled file system and there are down disks and there is replicas saved on the down disks. IJ05942
- Fix an issue in the AFM environment where AFM reads the file from home incorrectly if the data replication factor at cache is greater than one. IJ06264
- This update addresses the following APARs: IJ05280 IJ05320 IJ05350 IJ05351 IJ05352 IJ05353 IJ05355 IJ05357 IJ05358 IJ05482 IJ05605 IJ05606 IJ05607 IJ05611 IJ05612 IJ05613 IJ05614 IJ05628 IJ05759 IJ05760 IJ05761 IJ05762 IJ05763 IJ05765 IJ05766 IJ05942 IJ06225 IJ06264.
Problems fixed in IBM Spectrum Scale 4.1.1.19 [February 24, 2018]
- Fix a rare assert that can occur during metanode takeover due to a stalled indirect block left in the cache IJ02107.
- Fix a vectored DIO (writev/readv) dead lock which may happen if the filesystem is being quiesced IJ02107.
- Avoid assert when xattrs is heavily used with unusual block size setting IJ02107.
- Fix a problem in which mmgetstate -s may not display the correct number of quorum nodes defined in the cluster IJ02294.
- Fix a AIX/NFS servers deadlock that can occur trying to recover a client's fcntl lock following the loss of another node in the same cluster IJ02295.
- Address a problem where changing backend from NFS to GPFS (or viceversa) - can cause bad filehandle errors IJ02107.
- Update directory code to avoid excessive recursion that could lead to stack overflow. The stack overflow could cause the GPFS daemon to either crash with a Signal 11 or get stuck in a signal handler IJ02296.
- Fix code to avoid a potential GPFS daemon assert when the inode0 file grow to more than 4B blocks IJ02299.
- Fix the high CPU usage issue on Windows due to busy loop in receiver thread when there are some network errors IJ02298.
- This fix corrects an output error from the mmlspdisk command. This fix applies to GSS/ESS customers IJ02364.
- Fix a fcntl performance issue IJ03082.
- Fix a node crash that can occur when running mmbackup on a filesystem created in big endian format from a node different than that or vice versa IJ03302.
- Fix the potential data loss when archiving the data-in-inode file with the tar -S command IJ03078.
- Fix node hangs due to consumption of DMAPI event mailboxes IJ03079.
- Fix a problem in which lseek SEEK_DATA and SEEK_HOLE returns a wrong offset. Not AIX IJ03312.
- This update addresses the following APARs: IJ02107 IJ02294 IJ02295 IJ02296 IJ02298 IJ02299 IJ02364 IJ03078 IJ03079 IJ03082 IJ03302 IJ03312 IJ03366.
Problems fixed in IBM Spectrum Scale 4.1.1.18 [November 9, 2017]
- Fix a problem where failbackToPrimary --start command tries to delete any later snapshots than the latest snapshot present at the old primary end. But when that snapshot is not present at the acting primary - we need to handle the error continue the fail back command IJ00573.
- Fix the use count leak on the stripe group to resolve the stripe group cleanup pending issue IJ00573.
- Fix the assert (ibP->ibGeneration == ofP->metadata.getGeneration()) that could occur when flushing or writing indirect blocks. This issue only happens when clone files are used and deleted IJ00573.
- Fix the assert (!hasTokens()|(cacheObjFlags & 0x04)) which can occur during a race between release last hold and lookup thread IJ00573.
- Fix a log assert when a Windows node is added into a cluster that has an encrypted fs IJ00573.
- Address a problem where the first recovery on an AFM fileset fails, and the follow-up recovery is not able to detect a few operations and hence ends up in an out of sync relationship IJ00573.
- Fix a no space issue that can occur when running mmchdisk start command. The similar issue could happen on normal writes IJ00851.
- Add more provision to catch a case where a Queue item is becoming NULL when IO is happening to the fileset and the queue is being flushed IJ00573.
- Fix apparent deadlock with "dump waiters" showing 'waiting for exclusive ThSXLock for readers to finish' IJ00573.
- Fix an issue in the AFM environment where incorrect entries in the prefetch list file (ex. . and ..) causes the directory block corruption because AFM permits the filename as '.' to be created without validation of the input IJ00662.
- Correct the %filesetName that is passed to the callback command for the usageUnderSoftQuota callback event. This only affects FILESET quota types IJ00663.
- Apply this fix if mmapplypolicy occasionally hangs during directory scanning in your environment. The underlying bug manifested only rarely, but may be more likely to occur on certain Linux releases and highly multi-cored processors IJ00573.
- Fix quota usage accounting, in a file system with strict allocation "whenpossible", when not all data replicas can be allocated due to lack of space or failure groups IJ00664.
- Fix the issue that resending nsdMsgWriteExt RPC after a network reconnect function may result in a daemon crash with 'logAssertFailed: !"Request and queue size mismatch"' IJ00451.
- Address a problem where re-applyupdates should not invoke failbacktoprimary --start when failbacktoprimary --stop is failed due to changes being detected at acting primary IJ00573.
- Fix assert "'false' failed" in paxosserver.C:3129 in the GPFS daemon (CCR) happening during GPFS startup IJ00573.
- Fix a problem in which the Prefetch command did not work on a special character (--) named file IJ00665.
- Fix a log recovery problem that could cause log recovery to fail unexpectedly with error 217 when disk replica fails IJ00697.
- This update addresses the following APARs: IJ00451 IJ00573 IJ00662 IJ00663 IJ00664 IJ00665 IJ00697 IJ00851.
Problems fixed in IBM Spectrum Scale 4.1.1.17 [September 29, 2017]
- Fix an issue in the AFM environment where lookup and metatdata operations performed on the same file from different nodes can cause the daemon to assert IV99670.
- Fix the number of rpo miss error messages in mmfslog by dropping them exponentially. IE only send the 1st, 2nd, 4th, 8th, and so on IV99168.
- Fix the memory corruption issue that occurs during/after a reconnect IV99259.
- Fix a logAssert "!IsMemoryMappingFree" which is caused by a race between mmshutdown and 'tsctl nqStatus' IV99258.
- Fix a problem where the nsd was deleted and created again, then the node tried to reread disk configuration so it can update the nsd information, but network issues caused that to fail, then the node got stale nsd info that led to mount failure IV99672.
- Update prefetch code to prevent a possible GPFS daemon assert during the mmdelsnapshot command. The assert could happen when prefetch is reading from the snapshot that has been deleted IV99168.
- Fix an issue in the AFM ADR environment where secondary mount failure causes a kernel crash IV99168.
- Fix a problem in which fsetxattr did not work on a fd of an unlinked file IV99257.
- Fix a problem where a verbsRdmaSend enabled node sent excessive nsdMsgRdmaPrepare to an AIX node IV99669.
- Fix an issue in the AFM environment where UIDs in ACLS are not remapped during replication over NSD protocol when UID remapping is enabled IV99168.
- Address a problem in resync/failover/changeSecondary where while recreating a deleted file at home/secondary it might cause an invalid memory access and cause the daemon to crash IV99168.
- Fix a condition where cNFS on SLES12 or later fail to restart statd IV99168.
- Fix deadlock in inode cleanup in Linux kernel 3.13 and later IV99253.
- Fix an issue in the AFM environment where a fileset unlink or a unresponsive remote mount causes a deadlock IV99168.
- Fix a kernel assert caused by missing buffer lock checking IV99671.
- Fix a problem in which GPFS was returning EBADF when Ganesha provided an fd which is not a GPFS fd IV99168.
- Fix a problem in which mmsetquota did not work with a non-standard username IV99256.
- Fix a deadlock issue in the AFM enviroment when a new gateway node joins the cluster and it takes over the fileset from existing gateway node where the workload is running IV99168.
- Fix a problem in which mmlsquota -u did not work with a non-standard username IV99168.
- Address a problem where a bug in maintaining the ping thread to home - causes 2 ping threads to exist and race each other - causing a debug Assert IV99168.
- Address a problem where changeSecondary seems to do a lssnapshot on a non-existent snapshot and print the output to the terminal IV99168.
- Fix a bug where failure to execute the mmdevdiscover script resulted in all pdisks to temporarily lose their I/O paths. This caused the workload to pause while paths were recovered. Sometimes it caused the recovery group to fail over to the backup node. In a few instances, it resulted in an unmount of the file system, requiring manual intervention to restore services IV99255.
- Fix an issue in the AFM environment where incorrect filtering under certain workloads causes the writes to be dropped. This causes the replication not to happen fully and causes the data mismatch between cache/primary and home/secondary IV99764.
- Fix a deadlock that can occur during inode expansion where GPFS tried to reacquire the inodeFileExpandMutex that was already acquired by the caller. This deadlock problem is prone to occur when migrating FS version 2.2 to 3.4 and above that supports multiple inode spaces IV99668.
- This update addresses the following APARs: IV99168 IV99253 IV99254 IV99255 IV99256 IV99257 IV99258 IV99259 IV99668 IV99669 IV99670 IV99671 IV99672 IV99764.
Problems fixed in IBM Spectrum Scale 4.1.1.16 [August 10, 2017]
- Fix an issue in AFM environment where read and eviction on same file causes deadlock.
- Allow a user to specify the afmrpo interval in weeks[W],days[D] or hours[H].
- Fix a deadlock that can occur when changeSecondary of a primary fileset is in progress.
- Fix a problem where a fileset can be left in an intermediate state after a changeSecondary fails, and this prevented a follow up changeSecondary to proceed.
- Fix an Assert ofP->mnodeStatusIs(0x2 OR 0x4) fs/xattr.C line 5925 that can result from a very rare race condition between mailbox handler node failures.
- Fix a rare race between unlink, lookup and token revoke which causes kernel crash in d_revalidate.
- Fix a problem in which mmapplypolicy either hangs or hiccups during directory scan.
- Change EIO to ESTALET for open operation of a file that was deleted.
- Fix a problem when a metanode takeover happens, the non-metanode tries to merge the local file size change with what was committed to disk previously by last metanode, the file size was updated without setting the inode dirty, therefore missed committing the file size change to disk.
- Don't allow kernel modules to cleanup when removing gpfs.gpl if gpfs.gplbin is currently installed. This will prevent a system crash.
- Fix a signal 11 that can occur after changing gateway nodes to non gateway nodes and then creating an iw fileset and doing operations on it. Then change the non gateway nodes back to gateway nodes.
- The mmlsmount command has been changed on all platforms. The change only affects the output format of the -Y argument only when IPv6 address is used.
- Fix an issue in filesystem quiesce path where background snapshot deletion could cause long waiters and results in quiesce timeouts.
- Fix a DBGASSERT((SGFilesetId)FMFRecNum <= FILESET_MAX_VALID) in getFMFRecord, called by findFilesetById which can occur if mmrepquota or mmcheckquota comes across an inode or quota entry with a negative fileset id.
- Fix log code to close a timing window that can cause long waiters and prevent file system cleanup after the file system fails with panic.
- Configuring AD + RFC2307 based authentication scheme for FILE protocols with AD domain names containing special characters ( eg: - _ % ) is now supported.
- On gateway nodes with kernel versions 3.18 and later Mkdir/Creates on resync cases can fail with a false E_EXIST.
- Fix an NFS endless loop that can occur if a DM application takes an error trying to retrieve data.
- Fix a rare quota management deadlock caused by error conditions such as out of disk space.
- Fixed an issue in AFM+HSM environment where resync/failover/changeSecondary commands fails to replicate migrated files.
- Fix an issue in AFM environment where fileset force unlink could cause daemon crash.
- Address a problem where a gateway node can assert/crash when having more than 1024 active fileset operations occur across different filesets on a single gateway node.
- Update code to disable an assert that can cause system crash on AIX.
- Fix an issue in the AFM environment where unresponsive target causes queue to be dropped during the attribute setting.
- Fix a replicas mismatch problem caused by mmrestripefs -b wrongly reset the missupdate flag.
- Fix a problem where DMAPI invis read/write fails with an err 22 when calling from a non session node.
- Fix a terminal error that can occur during heavy CCR operation.
- Address a problem where renames across directories do not reset the dirty bit which in future leads to a big list of dirty directories and hence recovery on AFM filesets might take longer to scan. Also a minor addon for changeSecondary/failover/resync to also reset dirty directories.
- The display of mmdiag lroc is fixed for devices whose capacity is greater than 4TB.
- Fix a condition where mmautoload may hung.
- Fix a dynassert 'mmapFlushSXLock.isLockedShared' which may fail as a secondary failure while daemon is shutting down.
- Fix a problem in which gpfs.snap wrongly returns an error when using db2locssh remote shell which print out its own failure message if the remote shell exit with non-zero error code.
- Fix a problem in which you get a Unable to create file in fileset error even if the inode limit is not reached which is most likely to occur if the user fills up the fileset from a single node.
- Fix an assert in ~LLOpenFile when closing a stripe group caused by mmrepquota logic calling EndUse before unlocking the filesets.
- Fix a race problem in gpfs mmap code path that blocked a ps cmd.
- Fix a deadlock in very rare race condition after running snapshot commands.
- Fix a signal 11 error that is caused by a race condition between mmadddisk and block allocation.
- Fix a rare deadlock when accessing snapshot file data.
- Fix a policy problem which causes LOWDISKSPACE callback doesn't tigger after fs manager takeover if the old fs manager fails because of abort or lost connection.
- Fix a rare deadlock between thread handling mmap and thread handling memory map pagefault.
- This update addresses the following APARs: IV96776 IV97318 IV97426 IV97427 IV97428 IV97429 IV97431 IV97432 IV97434 IV97514 IV97515 IV97522 IV97526 IV97527 IV97807 IV98029 IV98204 IV98255 IV98488 IV98489 IV98888.
Problems fixed in IBM Spectrum Scale 4.1.1.15 [June 8, 2017]
- Fix a Ganesha crash caused by an applyUpdate.
- Fix a SIGBUS error that can occur during a mmap read on a snapshot file.
- Fix a deadlock that can occur if mmchmgr is run while mmpmon is running.
- Fix a system hang that can occur when a file system is suspended while doing a mmap.
- Fix a kernel crash that can occur while attempting to mount a loop device to a correspond file in a GPFS file system or while using a GPFS file system file as a LIO backend.
- Address a problem where applyUpdates continues to run even if the fileset at the old primary is unlinked or the mmfs daemon has been shutdown.
- Fix a daemon crash in AFM environment where creating a hardlink for the same file in different directories causes the daemon to crash. This crash happens if the file has more than 500 hardlinks.
- Address a problem where trying to queue a writeSplit message to the helper gateway's queue can fail with an error 28 (E_NOSPAC).
- Install if you suffer from mmapplypolicy/tspolicy hanging after otherwise finishing all work.
- Fix a problem with handling of small Linux AIO writes that can cause data loss due to incorrect file size update. This could happen when failures occur (file system panic or node failure) after small AIO writes that increases the file size.
- cNFS: fix a problem with /usr/sbin/rpcinfo not found in SLES12 or later.
- Fix a problem in which a GPFS command may wrongly terminate another process.
- Address a problem AFM is not able to replicate a preallocate command that is run at the primary site over to the secondary site.
- Fix a rare deadlock problem caused by stream write(enableRepWriteStream=yes).
- Increase the wait time for commands to execute, before failing.
- Fix a problem in which inodes become Busy after unmount with NFS and immutable files.
- Update log recovery code to avoid GPFS daemon assert after detecting invalid directory block during log recovery. Code has been changed to log a FSSTRUCT error and fail the log recovery so offline mmfsck can be run on the file system.
- This update addresses the following APARs: IV95743 IV95926 IV95927 IV95928 IV96285 IV96286 IV96287.
Problems fixed in IBM Spectrum Scale 4.1.1.14 [April 20, 2017]
- Fix an issue where mmlsdisk may get segfault when it receives a SIGTERM.
- Fix a deadlock that can occur when deleting a group of files on a gateway node in and AFM fileset and other AFM operations are in progress.
- Once the vfs open file (/proc/sys/fs/file-max) limit is reached, vfs does not allow you to open a new file. After that the node freezes and it needs to be restarted.
- Fix a gpfs cluster hang that can occur after a mmdiag --threads hangs.
- Fix a rare segmentation fault in mmgetstatus command.
- Fix a problem with the (mmprotocoltrace start smb -c
) command. The problem is that include-related errors are showing up in /var/adm/ras/mmprotocoltrace.log. - Fix a rare case assert "Assert exp(e == E_OK)" which happens while running mmcrfs command.
- Fix a deadlock in the AFM environment where adding a new gateway node using mmchnode could cause a deadlock when IO is happening.
- Fix a problem in which temporary files that can be used for debug purposes are being lost during resync/changeSecondary/failover failures.
- Fix a problem in which "mmlsquota -g" fails to get gid. This can occur on linux or AIX. This can occur if there is a very long line in /etc/passwd or /etc/group.
- Fix a kernel crash while gpfs handling OPENHANDLE_GET_VERIFIER op for Ganesha.
- Fix a file system unmounted due to SGPanic with error 301. This could happen when there are node failures during new token manager appointment process.
- Fix a problem where fileset doesn't move out of disconnected state but the home has an Active NFS running.
- Fix a memory leak that can occur on a client file read when checksums are enabled.
- Fix a problem in which socket files are treated as directories which result in an error 20.
- Add support for prefetching empty directories in the AFM environment.
- Fix memory leak in mmfsd, on clusters with cipherList set to AUTHONLY, especially in large clusters or those where nodes join and leave the cluster often.
- Fix a mmfsd crashes due to the 'committed == b.committed' assertion. This can occur when tiebreaker disks are in use and two concurrent CCR synods are running on two different nodes and one of them is of type leader update.
- This fix can reduce memory usage for quota on fsmgr. This can occur when customers have a large number of users and or groups with default quota on.
- Fix a problem with performing Direct IO on GNR/ESS NSD server node that can lead to GPFS daemon crash with signal 11 and/or user data corruption. This only occurs when an application is performing DIO on an active GNR/ESS nsd server node.
- Fix a problem in which applyUpdate does not properly update files. This can occur when the applyUpdate process is killed in the middle of the update and then another applyUpdate is attempted.
- Fix a GPFS daemon assert that can occur when running online mmmigratefs --fastea on a file system with a snapshot that contains small data in inode files.
- Fix code to prevent a potential kernel crash when performing read/write on a very large file. This can occur when the number of prefetched buffers goes over 32767.
- Fix a problem on cluster with no security keys, gpfs.snap may create mmfs.cfg files in /var/mmfs/gen/ with no read permission for group and others. This may cause some problem for subsequence commands especially for non-root users.
- Fix a problem with not properly handling quotas in an AFM environment. This can occur when you have very large hard and soft limit values.
- Fix automount on SELinux enabled RHEL7.
- Fix a daemon assert that can occur during a race condition between file deletion and the mmdelfileset command.
- Fix a problem in which an allocated object crosses a segment. This can occur on AIX.
- This update addresses the following APARs: IV94431 IV94556 IV94557 IV94559 IV94560 IV94982 IV95028 IV95029 IV95030 IV95031 IV95032 IV95033 IV95034 IV95036 IV95037 IV95040 IV95163.
Problems fixed in IBM Spectrum Scale 4.1.1.13 [March 3, 2017]
- Fix problems with data mismatch and snapshot mismatch between old primary and acting primary during an applyUpdates in the failbackToPrimary procedure.
- Fix a long wait on pcacheListMutex.
- Ensure gpfs.service is enabled after the upgrade on system running systemd version 219 or above.
- When a token manager fails or token manager list changes, GPFS will do token domain recovery. DOMAIN_RECOVERING status may cause token reset request being mishandled and leave a token in COPYSET status forever which makes subsequent requests on that token hang.
- Fix the multipath failure issue on RHEL 6.8, 7.2 and 7.3: "blk_cloned_rq_check_limits: over max size limit".
- Fix a problme in which Mmbackup falsely claims skipped files.
- Fix a deadlock when trying to queue transfer to the old gateway that was serving the fileset, when the new gateway node for that fileset is running a recovery for the fileset.
- Fix a rare race condition which can lead to the assert (blockToExpand == ofP->metadata.getLastDataBlock()) in expandLastBlock.
- Fix multiple issues in AFM environment to stabilize replication. Provided option to disable automatic resync.
- Fix a problem in convertToPrimary command where if the command is run with --secondary-snapname option, it reports that the fileset is left in primInitFail state (which should be the case since the secondary will be in acting primary state).
- Fix a E_ACCESS error when trying to write from CIFS client on SMB exports.
- Fix a problem in which mmrepquota writes many unused quota entries.
- Fix a problem in which the wrong disk size is reported during mmadddisk.
- This fix applies to GSS/ESS customers. Fix a problem in which mmchcarrier --replace fails trying to update firmware.
- Fix a problem in which mmgetstate reports the wrong status.
- This update addresses the following APARs: IV00001 IV00002 IV92216 IV92959 IV92974.
Problems fixed in IBM Spectrum Scale 4.1.1.12 [January 19, 2017]
- CNFS: fix recursive calls during shutdown which may cause LOGASSERT.
- In rare situations, registered quota files are deallocated or cracked accidentally. This will hinder GPFS FS from mounting. With this commit, a brand-new quota will be populated in this situation and GPFS fs will mount smoothly.
- Warnings are printed when TLS certificates are used to secure node-to-node connections.
- Fix a race condition that could leave a leftover lock that may hang mmcommand.
- Fix data corruption that can occur writing large files using parallel IO and multiple gateway nodes.
- Fix a "RPOName is not valid" error for SW/IW fileset recovery. This is an AFM specific change.
- Fix snapshot restore when building restore operation lists. This issue only happens when there are files with inode numbers bigger than the maximum value a 32 bits integer can hold.
- Fix "cryptographic library could not be initialized" when change cipherList on P8LE environment.
- Fix a problem in which you can get file or directory mismatches between app nodes and gateway nodes.
- Fix a very rare race condition that can lead to a kernel panic. This issue could only occur on Linux cluster when a mix of AIO and buffered IO are being used to read and write to the same file from multiple nodes.
- Fix a problem with online replica compare code that could lead to GPFS daemon assert when online replica compare is invoked concurrently with command to restart down disk via mmchdisk with start option.
- Fix a problem in which mmbackup returns the wrong number of objects handled. This can occur if NUMBERFORMAT is set incorrectly.
- Fix Kernel BUG: illegal operation locks_wake_up_blocks+0x6c.
- mmlsconfig: may not return correct value due to stale cache.
- Fix an AFM error 17 that can occur during a rename operation on an AFM fileset.
- Fix a remote error 17 creating and deleting hardlinks to files during log recovery.
- This modification does not change the functionality of GPFS, neither affects the appearance of the software, however it improves the effectiveness (speed) of the code in certain disk operations and introduces a mechanism (by explicitly distinguishing user- and kernel-space objects) that can be used for implementing other critical parts on s390x platform.
- Fix mmlsquota -j returning a wrong answer that can occur if there is a special character in the name of the stripe group.
- Fix AIX encryption performance.
- Fix a problem in mmedquota that can occur if there is lines in /etc/group or /etc/passwd which is more than 200 characters.
- Fix a problem in which mmbackup with -B value > 32768 causes missed files.
- Fix a problem in which gpfs_getacl returns ENOSPC. This can occur when the acl length exceeds the size of the buffer provided.
- Fix a problem in metanode optimization that can occur during directory lookup.
- Fix write timeouts that can occur during very large writes occurring from multiple gateway nodes.
- Fix a problem in which tslsenclslot logs tons of error messages stating that it failed to gather information. This fix is required for all platforms. The error condition occurs under heavy loads only. The cluster continues to operate correctly without this fix.
- Fix: "Failed to obtain the local environment update lock" error.
- Fix a 112 write error which can occur during a failbackTo primary in DR setup.
- Fix a gateway node crash that can occur during calls to gpfsReadAfmDRLastRPOSnapName. This is zlinux only.
- This update addresses the following APARs: IV89895 IV90403 IV91586 IV91587 IV91589 IV91590 IV91599 IV91600.
Problems fixed in IBM Spectrum Scale 4.1.1.11 [December 8, 2016]
- Fix FPO allocation code to prevent GPFS daemon assert that could occur when a locality group runs out of disk space.
- Remove a faulty Events Exporter assertion. The faulty assertion could cause the mmfsd daemon to fail in rare instances while generating an internal dump or responding to certain "mmfsadm saferdump" commands.
- Fix assert 'pStepReceiveP->phase == phase' that can occur when fsck is run in a tight loop.
- Fix assert exp(!ebP->replyDone) that can occur when DMAPI is being used.
- Fix slow offline fsck repair. Allow doDeferredDeletions() to cleanup afm pre-destroyed inodes in ex-afm filesets.
- Fix resync policy to make sure files with setuid bit are selected to be synced to home.
- Fix an issue in AFM environment where HSM migrated files at home are incorrectly brought into cache as fully sparse files. This causes data mismatch between cache and home.
- Fix the issue that 'mmbuildgpl' attempts to run 'depmod' against the currently running kernel instead of the one inside the xCAT postinstall chroot.
- Fix Renames clogging the AFM queue during recovery for IW filesets.
- Fix a deadlock that can occur during high volume file creations and deletions in a multi cluster environment.
- Fix offline fsck hitting assert itemP in pArray.C
- Handle upgrade of Windows node from 3.5 and 4.1 to 4.2 correctly.
- Honor parent folder's DELETE_CHILD right during rename operation from Windows node.
- Enable offline fsck to report replica mismatches.
- Fix a bug where offline fsck did not check for other good directory block replicas if the first replica was corrupt. This fix prevents losing the directory block in such situations and prevents data loss when running offline fsck repair.
- Return the correct errno when the number of open files exceed the maximum open file setting during open GPFS API applications.
- Fix resync thread not getting killed by deadlock detection code if it takes a long time to complete.
- Fix an assert on builds with DBGASSERT when dm_get_dmattr is called on files in a .snapshot directory.
- Fix the "FSErrBadDittoAddr" file system struct error that can happen on a clone child file.
- This update addresses the following APARs: IV90574 IV90598 IV90606 IV90608 IV90620 IV90629 IV90679 IV91155 IV91157 IV91328.
Problems fixed in IBM Spectrum Scale 4.1.1.10 [October 11, 2016]
- Fix a problem with choosing only fully cached files for Cache Eviction.
- Fix a SIGV problem when running mmdefrag on an FPO file system with trace enabled.
- Fix a remote mount failure that can happen if the cluster was previously added with incorrect contact nodes.
- Fix a race condition that could cause incorrect file sizes to be reported to the application. This only affect Linux and only when the application checks the file size right after the file size increased via writing past the end of the file.
- This fix will make sure the NFSv3 files are closed when using Ganesha.
- Fix a mmcrsnapshot command hang. This would only happen if a snapshot exists before and after the inode space expanded on a problem file system or fileset.
- Make GPFS work on Intel Broadwell CPU with kernel 3.7 or later.
- Fix an issue in AFM environment where fileset recovery causes gateway node to crash if recovery detects that the symlink have attribute(Example. chown -h user:name) changes.
- Fix a problem in which a recovery group failback to the primary node fails.
- This fix will make sure all different Ganesha up-calls are not dropped.
- When there are tons of users and groups intertwined with a lot of filesets with --perfileset-quota enabled this fix can improve the speed of mmcheckquota greatly.
- Fix a problem in which fcntl lock waiters are not resumed when a linux holder downgrades.
- If some physical disks in GPFS Native RAID (GNR) suffer a large number timeouts (very slow IOs), they should be declared "slow", drained and scheduled for replacement. This defect may have prevented them from being declared slow, meaning they will continue to lower performance and risk outages.
- Fix the "Assert on Structure Error on FileMetadata::getSnapDataBlockDiskAddr" when reading a clone child file from a snapshot file system. This issue only occurs on clone child files and only when the data of the clone child file is modified in the root file system after creating the snapshot.
- Fix the punch hole failure error number 760 on clone child files. This issue would only occur on clone child files.
- Fix a deadlock in the AFM environment during queue transfer when new gateway node joins the cluster.
- Fix a rare assert "isSGPanicked" in the Asynchronous Direct I/O code path.
- Fix code to prevent a potential GPFS daemon assert during file system restripe.
- Reduce ThreadStateMutex hold time in deadlock detection and waiter related functions to make GPFS run more smoothly.
- Fix a minor problem by not sending a delete snapshot request to secondary.
- Fix a problem with removes on silly renamed NFS files.
- Fix the "Node name is not valid" failure for mmrestorefs command.
- Fix a problem with re-prefetching immutable/appendOnly files.
- Fix the way error codes are being decoded and converted to AFM range from GPFS or system range.
- Fix a problem in which AFM does not honor async delay if softQMem threshold is set to 0.
- This update addresses the following APARs: IV89295 IV89297 IV89732 IV88301 IV88678 IV89280 IV89283 IV89860.
Problems fixed in IBM Spectrum Scale 4.1.1.9 [August 30, 2016]
- Remove a race condition which can cause a certain IO counter not to get updated and thus block file system panic processing which results in a hang.
- Fix code for handling mmfsctl suspend-write operations to prevent blocking of a close operation after reading a file or directory. This issue only affect nodes running AIX.
- Fix a deadlock that can occur in .shapshots directory of a fileset.
- Fix LROC related code to prevent possible GPFS daemon assert and false FSSTRUCT error.
- Correct missing entries in ".snapshots" directory of dependent fileset.
- Fix an assert that can occur during a Stripe Group panic.
- A new parameter has been added to the /var/mmfs/etc/expelnode callback script, which is invoked by the cluster manager when about to "expel" a node when one node cannot communicate with another. The parameter is set to "dryrun" when the script is not being invoked to make a decision on which node to expel. The parameter is set to "expel" when the exit value of the script is used to make the decision on which node to expel.
- Fix an assert that can occur during rapid repair.
- Fix an assert "fsdaP->getNValidAddrs() == nAllocated" which happens on FPO file system while doing fragment block allocation.
- Fix an assert that can occur during stripe group final cleanup and automated deadlock is active.
- Fix signal 11 caused by a very rare race between RG resign/relinquish and pdisk I/O activities.
- Fix race between fetch and log wrap thread that caused kernel assert.
- Fix a GNR server crash that can occur when a bunch of disks are being discovered.
- Avoid assert in fileset creation following an earlier failure due to low disk space.
- Allow GPFS on Windows to fetch IMU/RFC2307 mappings from a single alternate trusted domain.
- Fix offine fsck repair assert that can occur when a replicated file system has a disk in 'removing refs' state.
- Fix an issue in AFM environment where stress on fileset might cause memory usage to reach afmHardMemThreshold and subsequent queue drop might cause deadlock with incoming messages.
- Fix code that can cause GPFS daemon to report structure error assert on a FPO file system while doing snapshot data copy on write.
- Fix code to prevent possible GPFS daemon assert when adding new block to low level system file. This is most likely to occur when creating a new independent fileset using mmcrfileset command.
- Fix a problem that could result in a long waiter on gpdMutex. This can occur on a unstable cluster in which many nodes are being expelled.
- Fix a deadlock that can occur when prefetch recovery is happening and applications are trying to access the files in the fileset.
- Add support to allow mmchattr --delete-attr to be able to remove gpfs.BGF, gpfs.WAD or gpfs.WADFG.
- Fix a problem when both maxblocsize and scatterBufferSize are equal. The daemon asserts when the buffer memory usage is high.
- Fix an assert that was hit when running offline fsck for a file system that has one or more files with multiple trailing duplicate blocks.
- Fix informational messages from mmapplypolicy which contain "hit" counts for EXCLUDE rules, which were incorrectly reporting 0 when -N and -g options are used.
- Fix the clone parent files restore failure by revising the restore logic. This issue would only happen when the clone files need to be restored and the clone parent files have "immutable" attribute in restoring snapshot.
- Fix output for mmgetstate with -Y flag so that it matches GPFS documentation.
- Fix a problem in which GPFS commands may fail when public key are expired.
- Fix a possible segmentation fault in offline fsck during mutlipass dir scan. Multipass dir scan happens when there is insufficient memory to hold the fsck data structures required for dir scan phase.
- mmexportfs: Fix an issue where the file system exported but the output data file is missing.
- Fix the long waiter "InodeDeleteThread, 'waiting for XW lock'" caused by a self-deadlock issue, by correcting the release orders of GPFS file lock and Linux inode. This issue is Linux system only.
- Fix an issue in AFM environment where prefetch recovery could deadlock with an already running management program (ex. create snapshot).
- Add undocumented config parameter dataCollectionPendingDelay which controls how long we try to preserve the state of the world while collecting expel debug data. The code path is also optimized in general.
- Fix an issue in AFM environment where cache might read newly updated file incorrectly from home if the file size is more than 128MB and home is enabled for AFM. This happens when metadata of newly updated file is not yet committed to disk.
- Fix a problem in which resync is not able to push the file at cache to home, if the same file is modified at home over gpfs backend.
- Update log recovery code to better handle certain invalid data that could cause GPFS daemon to die with Signal 11. This change will allow offline mmfsck to run and fix the problem.
- Fix a mmbackup --rebuild failure to recreate a shadow DB on AIX 7.2.
- Fix an assert that can occur during a nfs read error of sparse info of a file using ctl interface.
- Fix a file authentication failure for AD with RFC2307 based ID mapping scheme with an AD domain having "-" in its name.
- Prevent command issues with disk balance option from running if the number of NSDs is more than 31 and the total number of PIT worker threads is more than 31. Affected commands are mmrestripefs -b, mmaddisk -r and mmdeldis -b.
- Fix an issue in AFM environment where application might get incorrect data if eviction and read is happening on same partially cached file.
- Fix a bug where mmchcluster may fail when run on a non-configure server to disable CCR.
- This fix will not show open files or directories that are renamed while they are open.
- This update addresses the following APARs: IV83274 IV83476 IV83899 IV85090 IV86157 IV87372 IV87385 IV87566 IV87567 IV87568 IV87569 IV87572 IV87573 IV87574 IV87601 IV87603 IV88299.
Problems fixed in IBM Spectrum Scale 4.1.1.8 [July 10, 2016]
- Fix assert "hasLoggedUpdate()" in repUpdate.C, line 1284 under stress workload.
- Fix problem where stress workload doing appends to a small file could cause kernel panic due to illegal pointer dereference.
- Fix a problem in which all CPUs hung on a file_lock_lock spinlock.
- Fix a data corruption issue that can occur after a successful mmrestorefs command completion.
- Fix an issue of getting incorrect data from the gpfs_ireadx() API. This issue only happens when using the mmrestorefs command and the gpfs_ireadx() API at the same time. This can occur on AIX and Linux systems.
- Fix Linux kernel asserts BUG_ON(page_mapped(page)) for GPFS file mmap.
- This issue is specific to GNR environment. It happens when a vdisk dump is issued before the RG is fully recovered after a mmfsd startup.
- Fix a problem in which the gateway node crashes when unmounting the FS. When this occurs the gateway node has to force unmount the FS.
- Fix a segmentation fault that can occur during file system panic processing.
- Fix an issue in the AFM DR environment where file lookup might cause the daemon to assert when the filesystem is already quiesced or suspended.
- Fix corruption which can occur when hawc is enabled and node failure is happening.
- Fix mmap page fault performance regression.
- Fix an issue in the AFM environment during gateway node startup where DR fileset activation for RPO snapshots might cause a deadlock.
- Fix a problem in buffer flushes that can cause a stale data buffer to be used for reads and writes after a Linux AIO write request was processed via a buffered I/O. This can only occur with AIO on Linux that is using the "io_submit" interface.
- Fix an issue in the AFM environment where recovery, resync and prefetch operations can fail because of large number of files to be queued.
- This fix is an improvement for mmap read SMP scalability.
- This fix will try to force through log recovery even when all stripes of a log home vdisk are marked stale (logically unreadable) in the metadata. This will only occur when run under debug control. This applies to GSS & ESS installations.
- This fix improves failure response when running helper node with too small a pagepool.
- Fix deadlock that can occur while calling fcntl with argument F_SETLEASE on Ubuntu 14.04.03.
- This fix correctly handles a write failure in the rare case where the number of pdisk faults exceed the fault-tolerance of the vdisk. This is seen in GSS & ESS installations.
- This fix will not let GPFS internal return codes be returned to Ganesha, it will be converted to an EIO rc. This will prevent a Ganesha crash.
- Fix a restore failure at the file moving phase when there is a very long file name in the file system and fileset.
- Fix mmgetacl on Windows to show valid instead of random ACL flags.
- Fix a segmentation fault that can occur when running the mmsetquota command. This issue would only happen when GPFS overwrite tracing is enabled on Linux.
- Fix a problem in which too much data is dumped when collecting data for deadlocks and expels. This was causing performance issues.
- Fix an issue in the AFM environment where cached bit is not set on files after reading from the home. This issue happens when the file modification times are not in sync between cache and home.
- Fix a problem in which old tiebreaker disks cannot be removed from the system.
- Fix a bug in mmremote to allow mmchconfig pagepool -i option to take affect immediately.
- Fix problem reading clone child via NFS fast read path.
- Fix a daemon crash that can occur while trying to execute a pcache command with maxThrottle set.
- Fix network communication problems that can occur when mmdiag --iohist and overload detection happen at the same time.
- Fix an alloc segment steal problem that can lead to more than 22 minutes of searching for a free buffer assert.
- Fix an issue in the AFM environment where filesets are moved to disconnected state because of a large number of filesets. This issue happens when socket descriptor values for home connection exceeds FD_SETSIZE(1024).
- Fix the random memory corruption and kernel crashes in the AFM environment which are likely to happen while deleting the non empty directory at home or secondary clusters.
- Fix for a spurious NSD RPC checksum error in GNR environments when processing a DIO workload with unstable IO buffers.
- GNR avoids reading from failing pdisks, by trying to reconstruct using parity/mirror. If reconstruct is not possible, then as a last resort, GNR reads from the failing pdisk. This will result in a lot less IO errors.
- Fix a problem in which the wrong errno was returned from dm_read_invis and dm_write_invis library functions in failure case.
- This fix enables UDEV_SUPPORT on all distributions.
- Fix a problem in which make Autoconfig fails after installing Ibm Spectrum Scale on the BlueGene IO node.
- Fix an assert on P7IH systems on which the recovery group was originally created under GPFS 3.4 when they try to upgrade to the current version. GSS and ESS customers are not affected by this change.
- Fix a mmdiscoverycomp failure that can occur if the cluster is configured to use different admin and daemon node names.
- Fix abnormal shutdown that can occur when trying to add back a node that has just been deleted.
- Fix a problem in QOS where skimperm bit calculation is incorrect when _skimf < 0.
- This update addresses the following APARs: IV83743 IV85083 IV85385 IV85409 IV85411 IV85418 IV85420 IV85421 IV85422 IV85426 IV85428 IV85429 IV85430 IV85432 IV85589 IV85590 IV85790 IV85862 IV85865 IV85866 IV86144 IV86153 IV86689 IV86701.
Problems fixed in IBM Spectrum Scale 4.1.1.7 [May 11, 2016]
- Fix fsck duplicate fragment problem report to be in a neater tabular format.
- Fix a rare fsck deadlock that can occur during fsck termination.
- Fix a bug that prevented offline fsck from reporting replica mismatches.
- Enable fsck to detect and repair duplicate sub-directory entries in a directory.
- Fix a problem with restripefs -R which was incorrectly setting the currentDataReplicas of logfiles.
- Fix a deadlock that can occur when FPO is enabled and a node's local stripe is panicked.
- Fix possible deadlock which occurs when a node loses quorum (cluster membership) because of a network adapter or network outage.
- Fix fsck handling of a compressed disk address.
- Fix a problem in which online replica compare reports a mismatches on the last block of an inode allocation map and a block allocation map file.
- Fix a hang that can happen when unmounting the filesystem.
- Fix a GPFS daemon abort that can occur when a GNR backup is performed and the server is down.
- Prevent GPFS daemon from asserting on Windows when collecting debug data on waiters.
- Fix a problem in which disable cluster CCR left an authorized_ccr_keys file behind which may cause a startup problem if cipherLists and or nistCompliance are changed.
- Fix an assert then can occur during online replica compare when the filesystem has different data / metadata buffer sizes.
- Fix a bug where offline fsck was not repairing some inode problems detected during dir scan phase.
- Fix a deadlock that can occur when NFS server/remote mount did not respond after doing AFM internal mount.
- Offline fsck in read-only mode will now warn about unavailable disks before scanning the file system.
- Allow mmchconfig to delete an empty nodeclass from the GPFS configuration node.
- Fix a daemon assert that can occur while doing prefetch reads along with readdir and lookup commands when the application nodes and the gateway nodes are the same.
- Fix a node crash that can occur during the recovery of another failed node if an EventsExporter "get nodes" request is issued at the same time.
- Fix an E_NOATTR link failure that can occur on a SW fileset while writing to a file and droppending and resync are being run.
- Fix autoload issues where GPFS may not come up on configure servers in a SERVER based cluster if files in /var/mmfs/gen/nodeFiles are missing.
- Fix a problem in which tsgescsiinfo reports invalid ESM information. This fix is required for all platforms. The condition seems to be SAS fabric related.
- Make a stuck tslsenclslot easier to diagnose
- Change mmbackup behavior when policy scan fails. Permit operation in a reduced-capacity to do backup and not expire when directory scan results are incomplete. When this happens, no expirations should be processed, just backup. Shadow DB lines for removed files should be left alone.
- Fix an issue where CES clients fail to connect after failover on Juniper switches.
- Fix unexpected empty CES IP configuration file.
- Lift restrictions on -B, --max-backup-count, and --max-expire-count
- The change ensures that the online replica compare does not throw false positive replica mismatches on files with last block being a sub-block.
- Install at your convenience, but especially if you have been adventurous trying unsupported QOS features on a 4.1.1 system. To use QOS feature, upgrade to 4.2 or higher. This disables QOS startup.
- Fix a kernel panic that can occur under a heavy write work load and a dying mmkproc thread.
- Fix a problem in which an amber disk "fault" light may remain on after temporary disk unavailability. The problem is specific to the 60 disk NetApp disk enclosure only.
- Fix memory tracking issue in AFM environment where gateway node memory usage appears like growing without any real memory leak. This causes replication to stop.
- Fixed an issue in AFM environment where random writes to same file causes memory leak after replication.
- Fix an issue in AFM environment where incorrect dependency causes resync to fail.
- If asynchronous NFS/NLM locking is used this fix will prevent potential kernel crash.
- Modification of ACLs via mmputacl or equivalent can render the ACL as missing on a GPFS Windows node.
- This update addresses the following APARs: IV81870 IV81877 IV83264 IV83271 IV84206 IV84251 IV84252 IV84253 IV84254 IV84255 IV84270 IV84428 IV84573 IV84574 IV84576.
Problems fixed in IBM Spectrum Scale 4.1.1.6 [March 31, 2016]
- Fix for an assert caused by an NSD being deleted and then quickly recreated
- Fix a cluster not being able to start when a node hosting LROC disks is not available.
- Fix a problem in which fsck incorrectly reports not enough memory available.
- Fix a problem in which fsck patchfile apply fails when it encounters a corrupted inode.
- Fix asserts on didEmpty and Signal 11 faults in delSnapshotEmpty that can occur during snapshot deletion.
- Fix AFM errors that can occur when writing to a large file during failover/resync.
- Fix a mmrestorefs assert which can occur during the delete clone file phase. The clone was left in a bad state during a force unlink of a fileset.
- Fix ENOENT failures that can occur during a snapshot restore and during iopen64 API calls.
- Fix a problem which may result in a daemon assert when running the mmcheckquota command and a snapshot is corrupted
- Fix an assert exp(ibdP->llfileP == this) that can occur during an offline fsck.
- Fix a daemon assert: (poolId != ((SGPoolId) -1)) in line 683 of FSTypes.h. The daemon assert could occur during mmrestripefile or mmchattr with -I yes after storage pool get deleted as part of running mmdeldisk with -p or -c option.
- Fix fsck repair of inode fullblocks field.
- Fix fsck handling of corrupt inode filesetId.
- Fix a deadlock that can occur during a failover while a HSM application is running.
- Fix assert exp(synched.isNULL()) that can occur during a high work load on a LROC disk.
- Fix a problem in which the GPFS/gskit installation process or various mm* administration commands can fail if Windows OS environment variables are changed in such a way that they do not exactly match the Windows installation directory name.
- Fix a problem that the primary RG server can't take back the RG after restoring pdisk paths, e.g. after cable pull, etc.
- Fix a problem in AFM environment where prefetch overwrites dirty files in local updates mode.
- Fix a MD5sum mismatch in data after resync operation which can occur if a resync, a touch, and a write all happens at the same time.* Fix a problem in which fsck wrongly reports holes in an ACL file.
- Fix a problem in AFM environment where large ACLs cannot be replicated because of buffer allocation issue.
- Fix a problem where gpfs_getacl returns a bad ACL entry when called with the GPFS_GETACL_STRUCT flag and acl_level GPFS_ACL_LEVEL_V4FLAGS.
- Fix a E_ROFS write error that can occur when you write over a clone file and make it a clone parent and then run recovery.
- To prevent confusion in messages between GNR, GSS, ESS products, and the GPFS file system metadata, the word "metadata" was removed from all GNR errors and log messages.
- Fix a deadlock in AFM environment where peer snapshot creation could deadlock with synchronous messages like (Lookup, Open, Read etc..). This can only occur if peer snapshots are enabled.
- Allow snapshots to be created while snapshots are being deleted.
- Fix a problem in AFM environment where replication would stop because of error while replaying Rename operation. AFM queue will be stuck state while replaying Rename operation and no new operations will be replicated.
- Fix an unexpected CES IP assignment and movement of CES nodes which are not ready to host CES IPs when the address distribution policy node-affinity is selected.
- Fix the deadlock in AFM environment where readdir results in deadlock under heavy stress over GPFS backend.
- Fix an gpfs.snap hang on an AFM node with stale NFS mounts.
- Fix a mmapplypolicy command fail when multiple commands are issued nearly simultaneously AND tscCmdPortRange has been configured in a SONAS environment.
- Fix a problem which stops autorecovery from being triggered if a node which has only dataAndMetadata disks is down.
- Fix a problem in which a Windows client lost view of ACLs in mixed Linux cluster.
- This update addresses the following APARs: IV78971 IV81342 IV81344 IV81347 IV81686 IV81873 IV81879 IV82179 IV82181 IV82182 IV82184 IV82238 IV82610 IV82637 IV83046 IV83110.
Problems fixed in IBM Spectrum Scale 4.1.1.5 [February 16, 2016]
- Fix an issue that could cause the GPFS daemon to abnormally terminate or that could cause the reporting of incorrect performance data when GPFS SNMP subagent, mmpmon, or zimon are utilized.
- Fix the code to build remote attributes during recovery when there is a version mismatch.
- Fix the SFSLink failure that can occur when files are created during failbacks.
- The performance of the daemon has been improved in the cases where the cipherList is set to a value other than 'empty' or AUTHONLY.
- Fix a quiesce assert that can occur when files are being recovered.
- Fix a problem that can occur when accessing files with managed regions.
- Fix a problem that can occur during clone operations.
- Fix a problem that file sizes were being set incorrectly on sparse files during failback.
- Fix a problem that can occur on a live file system that results in a deleted file still existing or a created file not existing after a snapshot restore.
- Fix a crash in msgMgrThreadBody that can occur during unmounting and unlinking filesets on a very busy system.
- Fix a problem where mmchfs -z, -Q or --perfileset-quota may fail when multiple mmchfs commands are being performed at the same time.
- Fix a problem in which a incorrect vdisk state is displayed by the mmlsrecoverygroup command during a DA rebuild.
- Fix a problem in which a signal 11 in verbsDisconnect_i is seen on one node when gpfs shut down on a different node. This problem can occur if the nodes are RDMA connected and are configured with a large fabnum value.
- Fix a problem with GPFS logging code that could cause GPFS daemon to die with signal 11. This problem can only occur on nodes with LROC enabled.
- Fix a problem that could cause a FSSTRUCT error to be logged when reading EA from a disk. This could only occur when LROC is enabled and the EA does not fit in the inode.
- Fix a daemon assert that can occur during recovery.
- If a GNR system using GSS hardware uses Lenovo-branded disks, this change enables recognizing disk FRU (field replaceable unit) numbers. This simplifies service procedures, and allows disk replacement without error messages.
- Fix an assert that can occur during a fileset delete. perfileset-quota needs to be enabled and the fileset needs to have quota entries.
- Fix an assert that can occur after a small write of data in the middle of a clone child on a system that heavily uses clones.
- Fix a hang that can occur while resync is running on a SW fileset that reaches it's hard memory limit.
- Fix a mmfsd daemon crash that can occur when Zimon is used to monitor the node, and a file system is force unmounted due to some unrecoverable file system error.
- Allow changing the daemon interface of a non-quorum node in a CCR enabled cluster.
- Fix a daemon assert that can occur during the stopping of a NFS server and there exists a fileset with expiration enabled.
- Restrict the mmchcluster command from disabling CCR in a cluster that has a CES node. Administrator must remove all CES nodes from the cluster or use the --force option to disable CCR.
- Fix an incorrect fileset name being displayed by the mmlsfileset command. This can occur after deleting a dependent fileset when a snapshot exits with the fileset before it was deleted.
- Fix a problem in which orphans can not be deleted from the ptrash directory.
- Fix a GNR server node crash that can occur during a network failure trying to connect the GNR server pair.
- Fix an assert that can occur when adding pdisks with the --replace option to a cluster and one of the pdisks is in a bad state.
- Fix an assert that can occur during a snapshot restore of a sparse file with a file size close to the maximum file size limit.
- Fix an assert that can occur during a fsck recreate of an ACL file.
- Fix a mmbackup command failure that can occur on an AIX node when the command line arguments are too long.
- Fix a problem in which a fileset is stuck in an unmounted state that can occur if the remote becomes stale and both the application node and the gateway node are the same.
- Fix an assert that can occur during a multi-node fsck on a 16MB block size file system that has more then 16M inodes.
- Fix a node crash that can occur during a rolling upgrade.
- Fix a mmfsd node crash that can occur when NSDRaid is not enabled.
- If a system built on GNR/GSS/ESS servers has been getting IO errors on GPFS file systems (reported all the way to the end user application, not internal disk IO errors on individual physical disks), and those IO errors happened exactly at a time when some pdisks were unreachable (for example due to cabling or connectivity issues), and those pdisks would have been reachable from the backup node of the GNR server, then this fix will prevent the IO errors, by failing the recovery group containing the affected vdisk over to the backup node.
- Fix a problem that can occur in the mmbackup command when /tmp is full.
- Fix a problem in which mmaddnode fails to copy the committed key file to the new node. This only occurs on a CCR disabled cluster and if there are 2 key files.
- Fix quorum loss when the network is broken between two nodes and the cluster is configured with tiebreaker disk.
- Fix a problem in which the hard memory limit is not honored when the fileset is in a disconnected state.
- Fix command failures in a CCR enabled cluster on nodes that also have non-GPFSgskit packages installed.
- This update addresses the following APARs: IV79340 IV79341 IV79751 IV79756 IV79761 IV79767 IV80404 IV80405 IV80407 IV80789 IV81068 IV81071.
Problems fixed in IBM Spectrum Scale 4.1.1.4 [December 15, 2015]
- Fix a problem counting the number of mmpmon clients; prevent improper double close of a file descriptor.
- Fix GNR AU log long waiters seen in SSD replacement.
- Fix a deadlock when GPFS writes to memory mapped buffer and the same thread a lock already on it.
- Fix the truncate(2) up failure issue on clone child file.
- Add gpdQuorumLossShutdown to be one of the assert condition.
- Fix the AFM write to sparse file to home hang issue.
- Fix an issue in log code which can cause log recovery to be incorrectly skipped after a node failure. This could only occur on a 4K aligned filesystem where GPFS runs into problem completing log wrap operation.
- Fix the restore failure when restoring clone children files.
- Fix data mismatch on clone child file after restore.
- Fix data mismatch on regular file which is not clone kind of file after restore.
- Fix log writebehind code to prevent writing log record to old disk address while log file is being migrated. This issue will show up as a log recovery error if a node fails shortly after a log record was written to a wrong location.
- Update log recovery code to set junction bit when replay log to recover directory for a newly created fileset. The missing junction bit can only be detected via offline fsck.
- Fix the failover/resync to support outband trucking.
- Fix the data inconsistency issue between cache and home during resync on appended files.
- Fix the restore failure that happened at attributes restoring phase.
- Fix deadlock scenario that can occur when deleting a snapshot.
- Fix the ACL/EA mismatch during resync by considering ctime changed option.
- With NFS backend, ATTR_MTIME_SET implies ATTR_MTIME, but GPFS ignores setattr(ATTR_MTIME_SET) if ATTR_MTIME is also not set.
- Fix code to avoid high CPU usage by the mmfsd process under Windows.
- Update locking code to prevent a GPFS daemon assert. The assert could happen when more than MaxFcntlRangesPerFile (default 200) advisory locks were placed on a single file.
- Customer may experience signal 11 when trying to delete pdisk in the middle of RG fail over. The fix is to eliminate this problem.
- Fix the dentry count leak by adding the code to call dput in error path.
- Fix out of quota errors that can occur on filesystems with a format less then 1400.
- Fix the mtime mismatch between cache and home for zero sized files by copying mtime from openfile to child attributes.
- Apply if you use -B number with number > 2**31-1 in any of your commands or scripts.
- Fix is recommended for all GNR (ESS/GSS) customers. The problem could occur in the event of an actual disk enclosure failure.
- With this feature, user will be able to add a 4K native disk to existing non-4K aligned file system if the disk is used dataOnly, and the file system data block size is at least 128K, and the file system version is at least 4.1.1.4.
- Fix the issue by allowing prefetch to continue if parent cannot be found for some files.
- Fix the memory mapped read performance issue on AFM filesets.
- Fix the mmrestorefs[479] : daemon command memory fault issue.
- Fix a problem with copying key files in mmsdrrestore where the node that is being restored does not have prompt less password access to the issuing node.
- Fix the case where the ESS storage enclosure slot location that is cached in the daemon can get stale and is not getting updated.
- Do not allow the AioWorkerThread to steal a dirty buffer. This prevents a deadlock.
- Fix the mmdiscovercomp command that is failing with "Constraint error" when trying to add servers to the component database.
- Fix code to avoid quorum loss declaration of the current cluster manager, when the network is broken between two nodes.
- Fix the fileset unlink hang by closing the control file before calling unmount.
- If a system built on GNR/GSS/ESS servers has been getting IO errors on GPFS file systems (reported all the way to the end user application, not internal disk IO errors on individual physical disks), and those IO errors happened exactly at a time when some pdisks were unreachable (for example due to cabling or connectivity issues), and those pdisks would have been reachable from the backup node of the GNR server, then this fix will prevent the IO errors, by failing the recovery group containing the affected vdisk over to the backup node.
- Add code to flush data buffers first before setting cached bit.
- Fix the path to the Linux modprobe command that the mmchfirmware command uses when --type adapter is specified.
- Starting with 4.1.1, GPFS changed the contents of the Linux NFS filehandle, compared to earlier versions (while still supporting older filehandles). This means if the AFM home is upgraded to 4.1.1 or later, existing AFM filesets detect a change in export since the filehandle changes and will suspend future synchronization with home. Similarly, a change from knfsd to Ganesha at home also causes a filehandle change even though the export is the same. The only solution is to resync the cache using failover which is expensive. This fix handles upgrades if home is running GPFS by detecting and upgrading cached filehandle when the filehandle changes for an inode.
- Fix the mmdiscovercomp command that is failing when there are multiple building blocks.
- Re-enable online replica compare and repair.
- This update addresses the following APARs: IV76482 IV78653 IV78662 IV78666 IV78669 IV78672 IV78810 IV78910 IV78912 IV78913 IV78914 IV78915 IV78932 IV79336 IV79338 IV79339.
Problems fixed in IBM Spectrum Scale 4.1.1.3 [October 29, 2015]
- Fix a problem in a Disaster Recovery (multi-site) environment. If a network outage prevents the two main sites from talking to each other while both sites can still communicate with the tie-breaker (single-node) site, it is possible that the cluster manager may end up moving from the primary to the backup site. That may cause the primary site to lose quorum.
- Fix a PreAlloc log assert which happens when "offset + len" wraps through zero.
- Fix a regression which breaks FPO locality aware restripe.
- Fix the api gpfs_get_fssnaphandle_by_name to return the proper number of bytes, when called from a 32 bit application, so that the heap is not corrupted.
- Fix a memory map I/O offset issue that GPFS may not handle I/O properly for very huge file.
- Fix the mmrestorefs command failure on data changes restore phase.
- Handle minquorumNodes correctly in CCR enabled cluster.
- Fix GPFS SNMP subagent to work with newer Net-SNMP versions. This fix should be applied to any GPFS cluster node given the role of snmp_collector, if it is running RHEL 7.1, or some other Linux version that includes Net-SNMP 5.6 or beyond.
- Do not return AFM-specific internal attributes in gpfs_fgetattrs().
- On 2.6.39+ linux kernel, add explicit blk_start_plug/blk_finish_plug inside gpfs io submit routine, let io scheduler have more chances to merge IOs into a bigger size one.
- This update addresses the following APARs: IV77541 IV77542 IV77544 IV78046.
Problems fixed in IBM Spectrum Scale 4.1.1.2 [September 10, 2015]
- Avoid buffer overrun risk in AFM multi-byte scratch file name generation.
- Fix the cause of a crash of the GPFS mmsnmpagentd daemon. The fix only applies to GPFS clusters where a node has been given the snmp_collector role, as seen in mmlscluster output.
- Fix mmbackup which could report success even when some designated files did not back up. The count of objects backed up can become inaccurate due to a persistent problem that the reported number of objects backed up can be inflated by "dsmc" when it chooses to back up additional items such as parent directories. Correct the count of objects backed up by carefully monitoring for any possible misrepresentation from the individual dsmc commands.
- Fix the mmrestorefs command failure at the attributes restore phase of the command.
- mmfsadm dump improvements: add more loop restriction to exit loop after dumping all the original number of cached record addresses and improve SIGFPE support during dump.
- Fix rare case of deadlock in direct IO code path when flushing the stolen buffer.
- Fix memory fault (core dump), loop or hang in mmimgrestore during exit processing.
- This fix affects environments installing the Object protocol on an external Keystone where the administrator wants the install to automatically create the Swift entries in the Keystone server.
- When slab allocator creation fails, printk a warning message then fail mmfslinux.ko load instead of panic the kernel.
- Fix a possible GPFS daemon crash when using the mmcharrier command to replace a disk in the P7 disk enclosure in which some of the disk slots were not populated. Fix is recommended for P7IH customers and not relevant to other systems.
- Re-enable quota limits automatically after "mmcrfs -Q yes" and "mmchfs -Q yes". It has been disabled wrongly since GPFS v4.
- Fix potential signal 11 encountered during dump of NSD IO buffers.
- Fix the daemon hang during handler cleanup in AFM environment.
- Fix an error when mmafmctl flushpending is invoked without fileset name.
- Fix the data restore problem for the small file which only has fragment block.
- This fix affects environments running the Object protocol with a locally-installed Keystone server with SSL support.
- Fix assert that might occur on systems configured with a small shared segment under stress workload that includes metadata updates and frequent buffer steals.
- Fix code to avoid removing wrong address during deletion of addresses from the cesiplist configuration file.
- Increased stability of the library used to retrieve keys used for file encryption from ISKLM.
- RecLockModuleReset call to __posix_lock_file encounters bad file pointer
- Fix a deadlock caused by not releasing the DMAPI lock in failure path of AFM read.
- Fix a problem that suspended disks are still marked as "tobeemptied" after successful restripe.
- Migrating files in RO fileset causes SetXAttr to be queued at gateway node.
- Fix the undefined symbols in 32-bit version of libgpfs.so.
- Fix null pointer dereferencing in AFM expiration code by limiting it to work only on valid and registered fileset handlers.
- The GSKit toolkit has been updated to version 8.0.50.47, which(1) fixes the vulnerability described in CVE-2015-1788 and (2) improves the performance of secure sends (cipherList is set to a cipher other than empty or AUTHONLY)
- Fix a problem that the GSS/ESS component database information can appear out of sync.
- Optimize cifsProcess::isRegistered when the hash chain is empty
- Fix a specific case where the remote cluster is removed before cleaning up the remote mount entries when using mmremotefs delete.
- Upgraded LROC to support new NSD disk layout.
- Drop the GNR track mutex when trying to acquire the log mutex
- Fix signal 11 in saveInodePts when configured to use a localCache.
- Fix performance degradation under a workload accessing a large number of files, due to unnecessary atime refresh messages.
- Improve performance for workloads with large numbers of files on systems with fast metadata storage.
- Update code to ignore EINPROGRESS error from flush when setting up pipe for invoking external script from GPFS daemon.
- Fix signal 11 in daemon caused by removing a localCache device.
- When mmchfs is run with a rapid repair option this fix will check to see if the file system is unmounted before executing the command. An error is issued if the file system is mounted.
- Update the threshold to print 'memory usage approaching the limit' warning message that was triggered too early.
- Fix a problem in the AIX operating system, where some system calls like open() may set errno to EPERM, even if returning successfully, when run from non-root users. System calls like shmat() (when used to map a file) may fail with the same value of errno.
- Relax server license requirement for NSD disks in system.log pool
- This update addresses the following APARs: IV75396 IV75999 IV76016 IV76017 IV76018 IV76019 IV76020 IV76383 IV76455 IV76457 IV76458 IV76461 IV76467 IV76471 IV76473 IV76475 IV76518 IV76759
Problems fixed in IBM Spectrum Scale 4.1.1.1 [July 30, 2015]
- Fix a rare case that could cause mmsnmpagentd to consume up to 100% of CPU when GPFS daemon terminates. Only affects clusters where a node is given the SNMP collector role.
- Avoid rare kernel assert while deleting many snapshots concurrently on a sluggish system.
- The command mmrpldisk now reports no space error instead of panic the GPFS file systems with several almost full disks.
- Provide a default user exit for nodeLeave event for FPO clusters so that the disks could be marked as down and the data integrity is not compromised.
- Print accurate remaining redundancy in the log when rebuild fails due to insufficient disk space.
- Fix "No disk name found" error when all of the disks are either in "emptied" or "to be emptied" state.
- nsdperf can hang when used on large number of linux nodes
- Change gpfs_prealloc not to preallocate blocks when the requested preallocate size is within the last block of the file but less than the file size. The allocation blocks are rounded to GPFS block boundaries when the file has fragments.
- mmdeldisk (relocation of aclFile blocks) results in lost ACL
- Fix a rare kernel crash case in incompleteAioListRemove when doing AIO on Linux.
- Fix a deadlock resulted from running fsck and recovery in parallel.
- Fsck reports false positive DA corruption
- Fix a problem that makes file blocks not distributed in metablock unit among nodes when the FPO file system has not enough failure groups.
- Enhance FPO autorecovery log for clarity
- Fix a problem encountered when dumping buffers with NSD checksum errors.
- When a vdisk I/O times out, failover the recovery group to the back up node.
- Prevent an assert due to a race condition while both creating and deleting snapshots concurrently.
- By moving the truncation operation of clone child files to the later delta restoring phase, the failure of truncation on clone child files is avoided.
- By extracting the right log file name from the input of "device" of mmrestorefs command, user should not see this internal failure error when the restore process failed.
- Fix a deadlock during Ganesha queue clean up. Now when the daemon crash we don't clean the Ganesha queue using the Ganesha thread, clean it later during SG cleanup.
- Fix slow performance of some administration commands when CCR (Cluster Configuration Repository) is enabled
- Correct a small vulnerability in takeover after SG manager failure during a snapshot command.
- Fix secondary kernel exception (get_stcP) on Linux cNFS server
- Enhance mmfsctl to work with topology vector failure group, NSD stanza file.
- Fix performance issues in ESS/GSS clusters in very high stress. This fix applies to customer with client nodes in a ESS/GSS cluster containing Connect-IB adapter.
- Fix memory fault (core dump) in mmimgrestore during exit processing
- Improve the performance of communication across daemons when the 'cipherList' configuration parameter is set to something other than empty or AUTHONLY.
- Ganesha: file descriptor was used after it was released causing assertion. Now the release is done at exit after all references the the files are done.
- Provide inode number information to an assertion within the low-level file write operation.
- Fix assert "openInstCount >= 0" under stress workload that includes file deletions.
- kxSendFlock needs to copyin user objects
- Fix a problem that the disk failed LED may not lit when setting the disk state to failed
- Fix a problem with AIO write pass the end file where file size change may be lost if GPFS daemon fails or file system panics shortly after write was completed.
- Fix a problem that mmdf show 0 free blocks for suspended disks
- Fix a kernel panic due to NULL pointer dereference during hard reboot of the partner node in Ganesha environment
- Fix a problem with DIRECT_IO write which can cause data loss when file system panic or nod e fails after write pass end of file using DIRECT_IO causes an increase in file size. The file sizeincrease could be lost.
- Enhanced the file system inconsistency state check during restore process and then graceful exit if detected.
- Fix the problem that data missed to write to new allocated datablock when file was expanded to size larger than old allocated datablock.
- Fixed the problem with VMWare NFS v3 client in Ganesha environment by providing an option to enable short_file_handle that VMware NFS client is using.
- Fixed a replicas mismatch problem that was caused by using wrongblock index in the indirect block.
- gpfs hadoop connector supports Hadoop 2.7.x release
- gpfs hadoop connector supports hdfs:// schema
- This update addresses the following APARs: IV74661 IV74686 IV74697 IV74732 IV75108 IV75394
Summary of changes for IBM Spectrum Scale version 4 release 1.1 as updated, June 2015
Changes to this release of the IBM Spectrum Scale licensed program and the IBM Spectrum Scale library include the following:
- Active file management asynchronous fileset-level data
replication for disaster recovery (DR)
Asynchronous replication of data at the file level enables you to create a primary(active)/secondary(passive) relationship at the fileset level. Data is asynchronously replicated to the secondary on a periodic basis. To enable this function, ensure that you run the following commands: * If you are migrating from a previous release, run the mmchconfig release=LATEST command. * Run the mmchfs -V full command. - Cluster Configuration Repository (CCR) Enhancements were made to restore broken configuration and files to bring a cluster back online or a broken node to a working state. In the case of a disaster recovery setup, steps are provided to downgrade the quorum assignments when half or more of the quorum nodes are no longer available at one of the sites. Consult the "Establishing disaster recovery for your GPFS cluster" topic in the IBM Spectrum Scale: Advanced Administration Guide.
- Cygwin 64-bit version requirement for Windows nodes The 32-bit version of Cygwin is no longer supported for Windows nodes running GPFS. Users that are running GPFS 4.1 with the 32-bit version of Cygwin installed must upgrade to the 64-bit version of Cygwin before installing IBM Spectrum Scale 4.1.1. Users with SUA on GPFS releases prior to 4.1 should upgrade directly to the 64-bit version of Cygwin.
- Data collection for expelled nodes When a node is about to be expelled for unknown reasons, debug data is collected automatically to help find the root cause.
- Deadlock amelioration Deadlock breakup requests can be issued on demand at a time that is chosen by a system administrator. A user callback for the deadlockOverload event can be added to notify a system administrator to check the system and workload for an overload condition.
- File Placement Optimizer (FPO) FPO enhancements deliver the ability to change block allocation of an existing file with the mmrestripefile and mmchattr commands and efficient removal of disks when disks have already been emptied with the auto recovery process. Auto recovery has been optimized to handle multiple failure and recovery events more efficiently.
- Fileset-level integrated archive manager (IAM) modes Fileset-level integrated archive manager (IAM) modes give users the ability to set four different IAM modes at the fileset level, including the root fileset, so that users can modify the file-operation restrictions that normally apply to immutable files. For more information, see the following: * topic about immutability and appendOnly restrictions in the Information Lifecycle Management chapter of the IBM Spectrum Scale: Advanced Administration Guide * mmchfileset and mmlsfileset command descriptions in the IBM Spectrum Scale: Administration and Programming Reference To enable this function, ensure that you run the following commands: * If you are migrating from a previous release, run the mmchconfig release=LATEST command. * Run the mmchfs -V full command.
- GPFS Native RAID (GNR) and Elastic Storage Server (ESS) documentation The documentation for GNR and ESS was removed from the information units in the IBM Spectrum Scale library. This includes GNR commands, GNR callbacks available to the mmaddcallback command, vdisk performance monitoring with the mmpmon command, messages in the ranges 6027-1850 - 6027-1899 and 6027-3000 - 6027-3099, and the chapter in the IBM Spectrum Scale: Advanced Administration Guide titled GPFS Native RAID (GNR). For more information about GNR, see GPFS Native RAID: Administration. For more information about ESS, see Deploying the Elastic Storage Server.
- Hadoop support Hadoop support was expanded from FPO storage to shared storage. This allows data stored in current GPFS clusters using shared storage to be accessible to Hadoop applications. IBM Spectrum Scale Hadoop Connector has been enhanced to transparently support both FPO based storage pools to leverage data locality and shared storage where locality information is not applicable. This allows FPO and shared storage pool to be used within the same file system, which allows Hadoop applications to access data in the entire file systems transparently. IBM Spectrum Scale Hadoop Connector fully supports Hadoop version 2.5, and it can also be used with Hadoop version 2.6 in compatibility mode (Hadoop file system APIs in 2.6 are not yet implemented). The mmhadoopctl command was introduced to simplify IBM Spectrum Scale Hadoop Connector configuration and management.
- Inode expansion optimization In this release, inode expansion, which allows dynamic growth of inodes, is optimized to reduce the contention that can flare up during bursts of file creates. To enable this function, ensure that you run the following commands: * If you are migrating from a previous release, run the mmchconfig release=LATEST command. * Run the mmchfs -V full command to enable all of the new functionality that requires different on-disk data structures. For more information, see the topics on completing migration and use of disk storage and file structure in file systems in the IBM Spectrum Scale: Concepts, Planning, and Installation Guide.
- Installation toolkit The installation toolkit can be used to do the following: * Install and configure GPFS. * Add GPFS nodes to an existing cluster. * Deploy and configure SMB, NFS, OpenStack Swift, and performance monitoring tools on top of GPFS. * Configure authentication services for protocols. * Upgrade GPFS and protocols. For details, see the spectrumscale command description in the IBM Spectrum Scale: Administration and Programming Reference.
- Multi protocol data access Data access to a shared storage infrastructure through enhanced protocol support for NFS, SMB, and Swift Object. For more information, see the IBM Spectrum Scale: Advanced Administration Guide and the IBM Spectrum Scale: Administration and Programming Reference.
- Performance improvements for mmfsck The mmfsck command can now store information that is found during a scan of the file system into a patch file. The information in the patch file can then be used as input to repairing the file system. Using a patch file to repair the file system prevents an additional scan before starting the repair actions. For more information, see the mmfsck command description in the IBM Spectrum Scale: Administration and Programming Reference.
- PIT inode list The parallel inode traversal (PIT) scan used for the mmchdisk, mmdeldisk, mmrestripefs, and mmrpldisk commands has now been updated to produce a list of inodes with interesting attributes, for example: those having broken disk addresses or those being ill placed. While the mmfileid command can be used to list files with broken disk addresses, this can be a slow process. Two new optional parameters, --inode-criteria CriteriaFile and -o InodeResultFile have been added to the more commonly-used mmchdisk, mmdeldisk, mmrestripefs, and mmrpldisk commands. These parameters allow you to find files matching certain criteria without a separate invocation of mmfileid. With this new feature, you can easily find the interesting files and their inode numbers. The output file will contain a list of inode numbers that meet the specified flags along with the name of the flag and the file type. For more information about these commands and for a description of the optional parameters and flags, see the commands in the IBM Spectrum Scale: Administration and Programming Reference. To enable this function, ensure that you run the following commands: * If you are migrating from a previous release, run the mmchconfig release=LATEST command. * Run the mmchfs -V full command to enable all of the new functionality that requires different on-disk data structures.
- Policy improvements: This release includes the following policy improvements: mmapplypolicy --sort-command SortCommand The mmapplypolicy --sort-command parameter allows you to specify an alternative sort command to be used, rather than the default sort command provided with the operating system. Implicit SET POOL 'first-data-pool' rule For file systems that are at or have been upgraded to 4.1.1, the system recognizes that, even if no policy rules have been installed to a file system by mmchpolicy, data files should be stored in a non-system pool if available (rather than in the system pool, which is the default for earlier releases). For more information, see the following: * Information Lifecycle Management chapter in the IBM Spectrum Scale: Advanced Administration Guide * mmchpolicy command description in the IBM Spectrum Scale: Administration and Programming Reference
- Quota management Quota management improvements for file system format 4.1.1 and higher include: * Allowing quota management to be enabled and disabled without unmounting the file system. To enable this function, ensure that you run the following commands: * If you are migrating from a previous release, run the mmchconfig release=LATEST command. * Run the mmchfs -V full command.
- Read replica policy In a file system with replicas, there are replicas for each data block stored in different disks in different failure groups. Now, using the readReplicaPolicy attribute of the mmchconfig command you can specify the location from which the policy is to read replicas. readReplicaPolicy lets you specify that the first replica be read, the local or closest replica, or the fastest. For more information, see the mmchconfig command in the IBM Spectrum Scale: Administration and Programming Reference.
- Performance Monitoring Tool The Performance Monitoring tool aims to provide performance information after collecting the metrics from GPFS and protocol nodes using the mmperfmon query command with an appropriate query. The tool helps in detecting performance issues and problems. The predefined queries and metrics help in investigating every node or any particular node that is collecting metrics. For more information, see the following: * "Performance Monitoring tool overview" topic in the IBM Spectrum Scale: Advanced Administration Guide * mmperfmon command description in the IBM Spectrum Scale: Administration and Programming Reference
- Documented commands, structures, and subroutines The following lists the modifications to the documented commands, structures, and subroutines: New commands The following commands are new: * mmces * mmdumpperfdata * mmhadoopctl * mmnfs * mmobj * mmperfmon * mmprotocoltrace * mmsmb * mmuserauth * spectrumscale New structures There are no new structures. New subroutines There are no new subroutines. Changed commands The following commands were changed: * gpfs.snap * mmaddcallback * mmafmctl * mmafmlocal * mmapplypolicy * mmbackup * mmbuildgpl * mmchconfig * mmchdisk * mmcheckquota * mmchfileset * mmchnode * mmchpool * mmchpolicy * mmcrcluster * mmcrfileset * mmdeldisk * mmedquota * mmfsck * mmlscluster * mmlsfileset * mmlsfs * mmlspolicy * mmlsquota * mmpsnap * mmrepquota * mmrestorefs * mmrestripefile * mmrestripefs * mmrpldisk Changed structures There are no changed structures. Changed subroutines There are no changed subroutines. Deleted commands There are no deleted commands. Deleted structures There are no deleted structures. Deleted subroutines There are no deleted subroutines.
- Messages The following lists the new, changed, and deleted messages: New messages 6027-962, 6027-2145, 6027-2230, 6027-2234, 6027-2235, 6027-2238, 6027-2240, 6027-2241, 6027-2242, 6027-2245, 6027-2246, 6027-2247, 6027-2248, 6027-2249, 6027-2250, 6027-2251, 6027-2252, 6027-2253, 6027-2254, 6027-2255, 6027-2256, 6027-2257, 6027-2258, 6027-2259, 6027-2260, 6027-2261, 6027-2262, 6027-2263, 6027-2264, 6027-2265, 6027-2266, 6027-2267, 6027-2268, 6027-2269, 6027-2270, 6027-2271, 6027-2272, 6027-2273, 6027-2274, 6027-2281, 6027-2282, 6027-2283, 6027-2284, 6027-2285, 6027-2286, 6027-2287, 6027-2288, 6027-2289, 6027-2290, 6027-2291, 6027-2292, 6027-2293, 6027-2294, 6027-2295, 6027-2296, 6027-2297, 6027-2298, 6027-2299, 6027-2300, 6027-2301, 6027-2302, 6027-2303, 6027-2304, 6027-2305, 6027-2306, 6027-2307, 6027-2308, 6027-2309, 6027-2310, 6027-2311, 6027-2312, 6027-2313, 6027-2314, 6027-2315, 6027-2316, 6027-2317, 6027-2318, 6027-2319, 6027-2320, 6027-2321, 6027-2322, 6027-2323, 6027-2324, 6027-2325, 6027-2326, 6027-2327, 6027-2329, 6027-2330, 6027-2331, 6027-2332, 6027-2333, 6027-2334, 6027-2335, 6027-2336, 6027-2337, 6027-2338, 6027-2339, 6027-2340, 6027-2341, 6027-2342, 6027-2343, 6027-2344, 6027-2345, 6027-2346, 6027-2347, 6027-2348, 6027-2349, 6027-2350, 6027-2351, 6027-2352, 6027-3255, 6027-3256, 6027-3257, 6027-3306, 6027-3307, 6027-3308, 6027-3309, 6027-3310, 6027-3311, 6027-3312, 6027-3313, 6027-3314, 6027-3315, 6027-3316, 6027-3551, 6027-3552, 6027-3553, 6027-3554, 6027-3579, 6027-3580, 6027-3581, 6027-3708, 6027-3709, 6027-3710, 6027-3711, 6027-3712, 6027-3713, 6027-3714, 6027-3715, 6027-3716, 6027-3717, 6027-3718, 6027-3719, 6027-3900, 6027-3901, 6027-3902, 6027-3903, 6027-3904, 6027-3905, 6027-3906, 6027-3907, 6027-3908, 6027-3909, 6027-3910, 6027-3911, 6027-3912, 6027-4000, 6027-4001, 6027-4002, 6027-4003, 6027-4004, 6027-4005, 6027-4006, 6027-4007, 6027-4008, 6027-4009, 6027-4010, 6027-4011, 6027-4012, 6027-4013, 6027-4014, 6027-4015 Changed messages 6027-625, 6027-872, 6027-1305, 6027-2181, 6027-2183, 6027-2229, 6027-2714, 6027-2715, 6027-2758, 6027-3248, 6027-3249 Deleted messages 6027-2622, 6027-2632, 6027-3511, 6027-3514, 6027-3515, 6027-3516, 6027-3536, 6027-3544
Problems fixed in GPFS 4.1.0.8 [May 26, 2015]
- Correct a small vulnerability in takeover after file system manager failure during a snapshot command.
- The code change ensures that online replica compare tool does not report false positive mismatches when the file system has suspended disks.
- Fix an AFM recovery issue during the fileset unlink.
- Fix a problem when determining whether copy-on-write is needed or not in the presence of snapshots. Sometimes this problem may result in spurious write operation failures (especially, but not limited to file/directory creation).
- Fix a hang in mmrestripefs, which may also result in waiters for "PIT_Start_MultiJob". The problem may happen if the set of nodes specified in the '-N' option to the command includes nodes which are still in the process of being started (or restarted).
- mmcrsnapshot, mmdelsnapshot and mmfileset commands quiesce the file system before they start actual work. During that quiesce if a thread doing file deletion of an HSM migrated file is stuck waiting for recall, since that recall could take long time due to slow tapes for example, then the mm commands could time out. This fix allows those commands to proceed while a deletion is waiting for recall.
- Close a very small window of deadlock caused by releasing the kssLock and and calling cxiWaitEventWakeupOne when a thread not waiting for the exclusive lock is waken up and leaving the thread actually waiting for the lock sleeping and waiting.
- Avoid a GPFS crash when running mmrestorefs or mmbackup where there are deleted filesets.
- Enable offline fsck to validate extended attribute file
- Fix a problem with directory lookup code that can cause FSErrInodeCorrupted error to be incorrectly issued. This could occur when lookup on '..' entry of a directory occurs at the same time as its parent is being deleted.
- Ensure that EA migration to enable FastEA support for a file system does not assert for 'Data-in-Inode' case under certain conditions
- Enable online fsck to fix AFM pre-destroyed inodes. Use PIT to cleanup unlinked inodes in AFM disabled fileset.
- Update allocation code to close a small timing window that could lead to file system corruption. The problem could only occur when a GPFS client has a file system panic at the same time as the new file system manager is performing a take over after the old manager resigned.
- Fix a signal 11 problem in multi-cluster environment when gpfs daemon relay the fsync request through metanode but the OpenFile got stolen on the metanode in the middle.
- Remove confusing trace stop failed error messages on Windows.
- The privateSubnetOverride configuration parameter may be used to allow multiple clusters on the same private subnet to communicate even when cluster names are not specified in the 'subnets' configuration parameter.
- This fix indicates that mmfileid command will not work if there is only GPFS express edition installed.
- Fix a workload counter used for NVRAM log tip I/O processing queues. Recommended if NVRAM log tip is in-use.
- Potentially avoid crash on normal OS shutdown of CNFS node.
- Fix issue where file create performance optimization was sometimes disabled unnecessarily.
- In a cluster configured with node quorum, fix a problem where, if the cluster manager fails and the cluster is left with only the bare-minimum number of nodes to maintain node quorum, the cluster may still lose quorum.
- Enable offline fsck to fix AFM orphan directory entries in single run
- Fix a problem where the number of nodes allowed in a cluster is reset from 16384 to 8192.
- This affects GSS/ESS customers who are using chdrawer to prepare to replace a failed storage enclosure drawer on an active system.
- Correct a problem in the v4.1 release with directory listings in file systems created prior to v3.2.
- Fix a problem that a race between log wrap and repair threads caused checksum mismatch in indirect blocks.
- Fix a daemon crash in AFM ensuring that the setInFlight() method have positive 'numExecuted' value while calculating the average wait time of the messages.
- Fix a problem on GPFS CCR cluster where GPFS commands may not work on inactive configuration servers after generated new security key.
- Fix command poor performance on cluster that has no security key.
- Fix a problem with DIRECT_IO write which can cause data loss when file system panic or node fails after a write passes the end of file using DIRECT_IO and causes an increase in file size. The file size increase could be lost.
- File cache filled-up with deleted objects (Linux NFS)
- Fix a hardlink creation issue by handling the E_NEEDS_COPIED error in SFSLinkFile function for AFM files.
- Fix handling of policy rules like ... MIGRATE ... TO some-group-pool THRESHOLD (hi,lo) ...
- The /var/mmfs/etc/RKM.conf configuration file used to configure file encryption now supports a wider set of characters.
- Trigger a back-off when 90% of the configured hard memory limit is hit during queuing of AFM recovery operations.
- ESS customers, using zimon, may see GPFS daemon crashes in the performance monitoring code.
- ESS customers, using zimon, may see GPFS daemon crashes in the performance monitoring code.
- Add support for multiple RDMA completion threads and completion queues
- Fix signal 11 in verbs::verbsCheckConn_i
- Fix signal 11 in runTSPcache caused by a uninitialized variable in error paths.
- mmauth inadvertently change cipherList to an invalid string. Changed Externals: New messages: GPFS: 6027-3708 [E] Incorrect passphrase for backend '%s'. GPFS: 6027-3709 [E] Error encountered when parsing line %d: expected a new RKM backend stanza. GPFS: 6027-3710 [E] Error encountered when parsing line %d: invalid key '%s'. GPFS: 6027-3711 [E] Error encountered when parsing line %d: invalid key-value pair. GPFS: 6027-3712 [E] Error encountered when parsing line %d: incomplete RKM backend stanza '%s'. GPFS: 6027-3713 [E] An error was encountered when parsing line %d: duplicate key '%s'. GPFS: 6027-3714 [E] Incorrect permissions for the /var/mmfs/etc/RKM.conf configuration file. Deleted messages: GPFS: 6027-3536 [E] Incorrect passphrase '%s' for backend '%s'. GPFS: 6027-3511 [E] Error encountered when parsing '%s': expected a new RKM backend stanza. GPFS: 6027-3515 [E] Error encountered when parsing '%s': invalid key-value pair. GPFS: 6027-3514 [E] Error encountered when parsing '%s': invalid key '%s'. GPFS: 6027-3516 [E] Error encountered when parsing '%s': incomplete RKM backend stanza '%s'. GPFS: 6027-3544 [E] An error was encountered when parsing '%s': duplicate key '%s'.
- This update addresses the following APARs: IV71419 IV71569 IV71601 IV71607 IV71613 IV71616 IV71628 IV71633 IV71634 IV71636 IV71648 IV71692 IV71815 IV72029 IV72033 IV72039 IV72042 IV72048 IV72684 IV72687 IV72688 IV72694 IV72695 IV72698 IV72700 IV72890.