During internal testing, a rare but potentially serious problem has been discovered in GPFS. Under certain conditions, a read from a cached block in the GPFS pagepool may return incorrect data which is not detected by GPFS. The issue is corrected in GPFS 3.3.0.5 (APAR IZ70396). All prior versions of GPFS are affected.

The issue has been discovered during internal testing, where an MPI-IO application was employed to generate a synthetic workload. IBM is not aware of any occurrences of this issue in customer environments or under any other circumstances. Since the issue is specific to accessing cached data, it does not affect applications using DirectIO (the IO mechanism that bypasses file system cache, used primarily by databases, such as DB2� or Oracle).

This issue is limited to the following conditions:

The workload consists of a mixture of writes and reads, to file offsets that do not fall on the GPFS file system block boundaries;
The IO pattern is a mixture of sequential and random accesses to the same set of blocks, with the random accesses occurring on offsets not aligned on the file system block boundaries; and
The active set of data blocks is small enough to fit entirely in the GPFS pagepool.

The issue is caused by a race between an application IO thread doing a read from a partially filled block (such a block may be created by an earlier write to an odd offset within the block), and a GPFS prefetch thread trying to convert the same block into a fully filled one, by reading in the missing data, in anticipation of a future full-block read. Due to insufficient synchronization between the two threads, the application reader thread may read data that had been partially overwritten with the content found at a different offset within the same block. The issue is transient in nature: the next read from the same location will return correct data. The issue is limited to a single node; other nodes reading from the same file would be unaffected.

GPFS requires certain level of Korn Shell on some Linux distributions to avoid memory leak issues and other various problems.

RHEL 5 should be at ksh-20100202-1.el5_6.3, or later
SLES10 should be at ksh-93t-13.17.19 (shipped in SLES 10.4), or later
SLES11 should be at ksh-93t-9.9.8 (shipped in SLES11 SP1)
SLES11 SP2 should be at ksh-93t-9.9.8 (shipped in SLES11 SP1)
RHEL6.1 and RHEL6.2 should be at ksh-20100621-12.el6_2.1 or later
GPFS Debian support is available only on Debian 6.
All other issues are documented in the GPFS FAQ, located here: http://publib.boulder.ibm.com/infocenter/clresctr/topic/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfs_faqs.html

Complete the following steps to install the fix package:

Unzip and extract the update package (< filename >.tar.gz file) with one of the following commands:

gzip -d -c < filename >.tar.gz | tar -xvf -

or

tar -xzvf < filename >.tar.gz

Verify the udpate's RPM or DEB images in the directory. Normally, the list of RPM images in this directory would be similar to one of the following:

GPFS update
gpfs.base.< update_version >.< arch >.update.rpm gpfs.docs.< update_version >.noarch.rpm gpfs.gpl.< update_version >.noarch.rpm gpfs.msg.en_US.< update_version >.noarch.rpm

GPFS update with GPL licensed kernel module
gpfs.base.< update_version >.< arch >.update.rpm gpfs.docs.< update_version >.noarch.rpm gpfs.gpl.< update_version >.noarch.gpl.rpm gpfs.msg.en_US.< update_version >.noarch.rpm

Debian packages
gpfs.base_< update_version >.< arch >_update.deb gpfs.docs.< update_version >.all.deb gpfs.gpl.< update_version >.all.deb gpfs.msg.en_US.< update_version >.all.deb

where
< update_version > specifies the version number of the update you downloaded, for example, 3.5.0-1.
and
< arch > specifies the system architecture, for example x86_64 for 64-bit System x.

For specific filenames, check the Readme for the GPFS update by clicking the "View" link for the update on the Download tab.
Follow the installation and migration instructions in your GPFS Concepts, Planning and Installation Guide.

In the below instructions, node-by-node upgrade cannot be used to migrate from GPFS 3.2 or 3.3 to 3.5. For example, upgrading from 3.3.x.x to 3.5.y.y requires complete cluster shutdown, upgrade install on all nodes and then cluster startup.

Upgrading GPFS may be accomplished by either upgrading one node in the cluster at a time or by upgrading all nodes in the cluster at once. When upgrading GPFS one node at a time, the below steps are performed on each node in the cluster in a sequential manner. When upgrading the entire cluster at once, GPFS must be shutdown on all nodes in the cluster prior to upgrading.

When upgrading nodes one at a time, you may need to plan the order of nodes to upgrade. Verify that stopping each particular machine does not cause quorum to be lost or that an NSD server might be the last server for some disks. Upgrade the quorum and manager nodes first. When upgrading the quorum nodes, upgrade the cluster manager last to avoid unnecessary cluster failover and election of new cluster managers.

Prior to upgrading GPFS on a node, all applications that depend on GPFS (e.g. Oracle) must be stopped. Any GPFS file systems that are NFS exported must be unexported prior to unmounting GPFS file systems. If tracing was turned on, then tracing must be turned off before shutting down GPFS as well.
Stop GPFS on the node. Verify that the GPFS daemon has terminated and that the kernel extensions have been unloaded (mmfsenv -u). If the command mmfsenv -u reports that it cannot unload the kernel extensions because they are "busy", then the install can proceed, but the node must be rebooted after the install. By "busy" this means that some process has a "current directory" in some GPFS filesystem directory or has an open file descriptor. The freeware program lsof can identify the process and the process can then be killed. Retry mmfsenv -u and if that succeeds then a reboot of the node can be avoided.
Upgrade GPFS using the RPM command as follows:

For SLES or RHEL systems:

GPFS update
rpm -U gpfs.base-< update_version >.< arch >.update.rpm rpm -U gpfs.docs-< update_version >.noarch.rpm rpm -U gpfs.gpl-< update_version >.noarch.rpm rpm -U gpfs.msg.en_US-< update_version >.noarch.rpm

GPFS update with GPL licensed kernel module rpm -U gpfs.base-< update_version >.< arch >.update.rpm rpm -U gpfs.docs-< update_version >.noarch.rpm rpm -U gpfs.gpl-< update_version >.noarch.gpl.rpm rpm -U gpfs.msg.en_US-< update_version >.noarch.rpm

For Debian systems:

dpkg -i gpfs.base_< update_version >.< arch >_update.deb dpkg -i gpfs.docs.< update_version >.all.deb dpkg -i gpfs.gpl.< update_version >.all.deb dpkg -i gpfs.msg.en_US.< update_version >.all.deb
Check the GPFS FAQ to see if any additional images or patches are required for your Linux installation: General Parallel File System FAQs (GPFS FAQs)
Recompile any GPFS portability layer modules you may have previously compiled. The recompilation and installation procedure is outlined in the following file:
/usr/lpp/mmfs/src/README

[July 12, 2012]

GPFS file systems created with versions earlier than 3.4 should not be migrated using the mmmigratefs command with the --fastea option until a fix is provided from IBM. IBM plans to make the fix available in GPFS versions 3.5.0.3 (APAR IV24151) and 3.4.0.15 (APAR IV24150). An ifix will also be available from IBM Service.

[January 24, 2011]

A fix introduced in GPFS 3.3.0-11 and in GPFS 3.4.0-3 changed the returned buffer size for file attributes to include a dditional available information, affecting the TSM incremental backup process due to the selection criteria used by TSM. As a result of this buffer size change, TSM incremental backup will treat all previously backed up files as modified, causing the dsmc incremental backup process to initiate new backups of all previously backed up files. If the file system being backed up is HSM managed, this new backup can result in recall of all files which have been previously backed up. This effect is limited to files backed up using TSM incremental backup; there are no known effects on files backed up using either GPFS mmbackup or the TSM selective backup process.

This issue is resolved in GPFS 3.3.0-12 (APAR IZ92779) and GPFS 3.4.0-4 (APAR IZ90535). Customers using the TSM Backup/Archive client to do incremental backup (via dsmc incremental command) should not apply GPFS 3.3.0-11 or GPFS 3.4.0-3, but should wait to apply GPFS 3.3.0-12 or GPFS 3.4.0-4. Any customer using TSM incremental backup and needing fixes in GPFS 3.3.0-11 or 3.4.0-3 should apply an ifix containing the corresponding APAR before executing dsmc incremental backup using these PTF levels, to avoid the additional file backup overhead, and (in the case of HSM-managed file systems) the potential for large scale recalls caused by the backup. Please contact IBM service to obtain the ifix, or to discuss your individual situation.

[October 26, 2010]

The restriction below is no longer in effect. GPFS file systems with file system format version less than 9.00 as reported by mmlsfs (V2.3 and older) can now be mounted on a GPFS V3.4 cluster safely.

[July 30, 2010]

Restrictions: File systems that currently have file system format version less than 9.00, as reported by mmlsfs (this format version corresponds to GPFS V2.3 and older) cannot be mounted on a GPFS V3.4 cluster. This restriction will be lifted in a future GPFS update.

[April 1, 2010]

Click here for details.

[March 31, 2010]

Support for SLES 10 kernel beyond 2.6.16.60-0.58.1 has changed. GPFS 3.3 requires GPFS 3.3.0-5.

[November 9, 2009]

GPFS 3.3.0-1 does not correctly operate with file systems created with GPFS V2.2 (or older). Such file systems can be identified by running "mmlsfs all -u": if "no" is shown for any file system, this file system uses the old format, and the use of GPFS 3.3.0-1 is not possible. GPFS 3.3.0-2 corrects this issue.

The update images listed below and contained in the tar image with this README are maintenance packages for GPFS. The update images are a mix of normal RPM or DEB images that can be directly applied to your system.

The update images require a prior level of GPFS. Thus, the usefulness of this update is limited to installations that already have the GPFS product. Contact your IBM representative if you desire to purchase a fully installable product that does not require a prior level of GPFS.

After all RPMs or DEBs are installed, you have successfully updated your GPFS product.

Update to Version:

3.5.0-4

Update from Version:

3.5.0-0 through 3.5.0-3

Update (tar file) contents:

README

changelog
gpfs.base-3.5.0-4.x86_64.update.rpm
gpfs.docs-3.5.0-4.noarch.rpm
gpfs.gpl-3.5.0-4.noarch.rpm
gpfs.msg.en_US-3.5.0-4.noarch.rpm
gpfs.base_3.5.0-4_amd64_update.deb
gpfs.docs_3.5.0-4_all.deb
gpfs.gpl_3.5.0-4_all.deb
gpfs.msg.en-us_3.5.0-4_all.deb

Unless specifically noted otherwise, this history of problems fixed for GPFS 3.5.x applies for all supported platforms.

Problems fixed in GPFS 3.5.0.4 [September 14, 2012]

Fix code to prevent a rare condition where many inode expansion thread can get started by periodic sync. This can cause GPFS daemon to run out resources for starting new threads.
Fix a segfault prlblem after node takeover as ccmgr and finish dmapi session recover process.
Fix the code that can cause a GPFS daemon assert when multiple thread working on same file caused a race condition to occur.
Add environment variable MMBACKUP_RECORD_ROOT specs an alternate dir to store shadow files, list files, temp files, etc.
Fix an assert encountered during opening of NSDs. This assert occursdue to a rare race condition which requires the device backing particular NSDs to completely disappear from the operating system while opening the NSD.
This fix only applies to any customer who want SUBSTR interpreted sensibly for negative indices.
Fix null pointer dereference when an RDMA connection breaks during memory buffer adjustment and verbsRdmaSend is enabled.
Mask out ReadEA (which is the same as ReadNamed) from unallowed rights so that the lack of it is not interpreted as a denial. Only the presence of an explicit ACE can deny the ReadEA right.
Fix an issue in a mixed version cluster, where a node running running GPFS 3.4 or older failing in a small window during mount could cause spurious log recovery errors.
Fix CNFS to recognize GPFS filesystem in RHEL6.3.
Fixed assert happened in trace statement after xattr overflow block was copied to snapshot.
This fix applies to any customer who needs to kill the scripts started by mmapplypolicy. Or who is wondering why on AIX, a faulty program startedby mmapplypolicy "hangs" instead of aborting.
Fix assert "MSGTYPE == 34" that occurs in pre and post-3.4.0.7 mixed multicluster environment.
offline bit gets lost after CIFS calls gpfs_set_winattrs.
Fix a problem which occurs in GNR configurations with replicated file systems. Should an NSD checksum error occur between an NSD client and GNR server, the first such error on a transaction could be mistakenly ignored, resulting in no callback invocation or event generated for it. Additionally, if the checksum error is persistent on the same transaction, the code could attempt to retry the transaction one more time than allowed by the configuration settings.
Fix sequential write performance and deadlock since 3.4.0.12 and 3.5.
This fix applies to any customer who has policy rules that reference the PATH_NAME variables AND who might encounter a path_name whose length exceeds 1024 bytes.
Fix segfault in dm_getall_disp() functions.
This update addresses the following APARs: IV27283 IV27287 IV27288 IV27290 IV27291.

Problems fixed in GPFS 3.5.0.3 [August 21, 2012]

Fixed potential live-lock in snapshot copy-on-write of the extended attribute overflow block when the next snapshot is being deleted. Problem occurred in rare cases after the inode file increases in size
mmbackup will check if session between remote TSM client node and TSM server is healthy and will remove the combination from transaction if non-healthy situation is detected
Prevent an assert accessing files via DIO
mmbackup will filter ANS1361E Session Rejected: The specified node name is currently locked error and will exit error
mmbackup will filter filename that contains unsupported characters by TSM
When a tiebreaker disk is being used, avoid quorum loss under heavy load when the tiebreaker disk is down but all quorum nodes are still up
Fix the file close code to prevent a daemon assert which can occurs on AIX with DMAPI enabled filesystem
Fix an infinite wait when delsnapshot
Fix a problem that mmdf can not return correct inode info in a BigEndian and LittleEndian mixed cluster
Fix an assert when copy the inode block to previous snapshot
Added logic to reduce the chance of failure for "mmfsadm dump cfgmgr"
When a tiebreaker disk is used, prevent situations where more than one cluster configuration manager is present simultaneously in the same cluster
Fixed old bug in getSpareLogFileDA due to a typo
Fix assertion failure when multiple threads use direct I/O to write to the same block of a file that has data replication enabled
Fix daemon crash, during log recovery, when log file becomes corrupted
Fix a problem that would cause mmadddisk failure
Fix assert "isValid()" that occurs during mmbackup a snapshot
Fix an assertion caused by leftover "isBeingRestriped" bit after a failed restripe operation
Update mmrpldisk to issue warning instead of error when it can not invalidate disk contents due to disk been in down state
Fix regression introduced in 3.4.0.13 and 3.5.0.1 that could in some cases cause "mmchdisk ... start" to fail with spurious "Inconsistency in file system metadata" error
Avoid deadlock creating files under extreme stress conditions
Fix code to ensure E_ISDIR error get returned when FWRITE flag is used to open a directory
Fix snapshot creation code to prevent a possible GPFS daemon assert when filesystem is very low on disk space
Fix problems with using mmapped files after a filesystem has been force unmounted by a panic or cluster membership loss
Fix regression where a race condition between restripe and unmount could cause the GPFS daemon to restart with error message "assert ... logInodeNum == descP->logDataP[i].logInodeNum" in the GPFS console log
mmbackup will report severe error if dsmc hit ANS1351E (Session rejected: All server sessions are currently in use)
Fix issue in multi-cluster environment, where nodes in different remote clusters updating the same set of files could cause deadlock under high load
mmbackup will filter filename with newline correctly
Improve error handling for completed tracks
Fix a bug that causes slowness during mmautoload/mmstartup on systems with automount file system. The performance hit is noticeable on large clusters
Prevent very rare race condition between fileset commands and mount
Fixed rare assert in log migration
Fix assert "writeNSectors == nSectors" that occurs during "mmchfs --enable-fastea"
Update mmlsquota -j Fileset usage message
Fix allocation message handler to prevent a GPFS daemon assert. The assert could happen when a filesystem is been used by more than 1 remote cluster
Block Linux NFS read of a file when CIFS holds a deny share lock
Speed-up recovery when multiple nodes fail, and multiple mmexpelnode commands are invoked with each failed node as target. Applies mostly to DB2 environments
Fix rare assert under workload with concurrent updates to a small directory from multiple nodes
Fix null ptr dereference in case of i/o failure case on gw node
Fixed hang problem when deleting HSM migrated file after creating a snapshot
Fix a GPFS API gpfs_next_inode issue that it doesn't scan the file whose inode number is the max inode number of file system or fileset
Fixed assertion when generating read or destroy events
Fix the mmcrfs command to handle the -n numNodes value greater than 8192
Extend mmbackup's tolerance of TSM failures listed in the audit log even when paths are duplicate or unrequested. TSM frequently logs in the audit log a number of unexpected path names. Sometimes the path name is a duplicate due to repeated errors or due to TSM trying to back up objects in a different order than presented in the list file. Other times the object simply was not requested and it tries to back it up anyway. Make mmbackup ignore these log messages during shadow database compensation. Log all uncompensated error messages to files in backupStore (root) in mmbackup.auditUnresolved. and mmbackup.auditBadPaths. Add new debug bit to DEBUGmmbackup: 0x08 to cause a pause before backup activities commence and a second pause before analysis of audit logs. Correct minor errors in close() handling of various temp files
Fixed sig 11 when background deletions is trying to access OpenFile object that was removed from cache while waiting for quiesce to finish
Fixed race condition between FakeSync and RemoveOpenFile
Fix a kernel panic which caused by a race between two nfs read
Fix a restripe code that could cause a potential filesystem corruption. The problem only affect filesystem that was created without FASTEA enabled but was later upgraded to enable FASTEA via mmmigratefs with --fastea option
Loss of access to files with ACLs can occur if independent filesets are,or have been, created in the filesystem
This fix only applies to customers running GPFS on Linux/PowerPC, using WEIGHT clauses in their policy rules
Fix mmdeldisk to ignore special files that do not have data in a pool
Close a hole that gpfs_ireadx/ireadx64 cannot find more than 128 delts. Close a hole that call gpfs_ireadx/ireadx64 for an overwritten file may get assert if the input offset is not 0
Fixed a problem where 'mmchmgr -c' fails on a cluster configured with a tiebreaker disk, resulting in quorum loss
EINVAL returned from gpfs_fputattrs when an empty NFSv4 ACL is included
FSErrBadAclRef reported when lockGetattr called RetrieveAcl with a zero aclRef
Fixed deadlock resulting out-of-order aclFile/buffer locking
This fix only applies to customers who have set tscCmdPortRange, running mmapplypolicy, running a firewall that is preventing policy from exploiting multi-nodal operation
Fix code to avoid unavailable disks when there is no metadata replication
Fix rare race condition where a node failure while writing a replicated data block under certain workloads could lead to replica inconsistencies. A subsequent disk failure or disk recovery could cause reads to return stale data for the affected data block
Fix hung AIX IO when the disk transfer size is smaller than the GPFS blocksize
gpfs_i_unlink failed to release d_lock causing d_prune_aliases crash
This fix only applies to customers who are on AIX and have gotten "no enough space" errors when running mmapplypolicy
This update addresses the following APARs: IV21750 IV21756 IV21758 IV21760 IV23290 IV23810 IV23812 IV23814 IV23842 IV23855 IV23877 IV23879 IV24151 IV24382 IV24426 IV24942 IV25185 IV25484 IV25487 IV25488 IV25762 IV25763 IV25771
IV24937 is documented further at the URL: http://www.ibm.com/developerworks/forums/thread.jspa?threadID=448578&tstart=0

Problems fixed in GPFS 3.5.0.2 [May 30, 2012]

mmbackup will exit 1 when auditlog file is not available for result analysis after backup transaction is done.
Fix a problem stealing buffers in a large pagepool after installing 3.4.0.11.
When backup partially fail, mmbackup continues to compensate shadow file even thoughthere are multiple failed reported for the same file in auditlog file.
Fixed a bug in log recovery which could result in a "CmpMismatch" file system corruption problem.
Fix for the iP->i_count == 0 kernel assert in super.c. This problem onlyaffects Linux 2.6.36 and later.
Fix a rare deadlock where a kernel process gets blocked waiting for a free mailbox to send to the GPFS daemon.
mmbackup will exit 1 when incremental backup partially fail and shadow file compensation succeed.
Correct mmlsfileset output for junctions of deleted filesets in some cases.
Fix a memory allocation problem when online mmfsck runs on a node with a heavy mmap workload.
mmbackup will not stop processing even though there's no auditlog file if only expiration processing is done.
mmbackup will display progress msg "Expiring files..." correctly if expiration transaction takes longer than 30 mins.
Prevent the cluster manager from being expelled as a consequence of some communication outage with another node.
mmbackup with multiple TSM clients will catch all error messages from dsmc command output.
Fixes problem where the 'expelnode' callback indicates that the chosen node had joined the cluster first.
Fix a problem with nBytesNonStealable accounting.
Fixed message handler for filesystem quiesce which caused a GPFS assert when filesystem manager failed while filesystem is being quiesced.
Fix printing of long fileset names in mmrepquota and mmlsquota commands.
Fix mmap operations to go through nsd server when direct access to disks are no longer possible.
Fix mmsetquota to handle numerical fileset names.
mmbackup can backup files/directories with long pathname as long as GPFS and TSM support.
Fix an error message in mmchattr command with -M/R/m/r option.
Fix a problem that restripe failed in to an inifinite loop when sg panicked on the busy node.
mmbackup will display backup/expiration progress message in every interval specified by MMBACKUP_PROGRESS_INTERVAL environment variable if specified. Otherwise, mmbackup will display backup/expiration progress message every 30 mins.
Fixed rare assert when deleting files in a fileset.
Fixed rare hang problem during sg or token recovery.
Fix deadlock when doing inode scan (mmapplypolicy/mmbackup) in small pagepool.
getxattr for ACLs may ovewrite the kernel buffer if small buffer sizes (less than 8 bytes) are specified.
When mmbackup shadow file is rebuilt by --rebuild or -q option, mmbackup will get CTIME information from TSM server, hence files modified after previous backup but before shadow is rebuilt will be backed up by consequent incremental backup.
GNR: fix a problem where certain errors from a pdisk, like media errors, caused RecoveryGroup open to fail. Change code to continue attempting to open the RecoveryGroup and simply discount the pdisk(s) returning media errors(and unexpected error codes).
Prevent disks from being marked as 'down' when a node with the configuration option unmountOnDiskFail=yes receives an I/O error or loses connectivity to a disk.
When mmbackup can't backup files, the message is more informational.
Fixed mailbox calls which can lead to deadlock during filesystem quiesce. The deadlock is most likely to happen on a extremely overloaded system.
When backup fail (partially or entirely) due to error from TSM client, mmbackup will display error msg from TSM cleint for easy problem detection. But mmbackup will display the error msg only once for the same error even though multiple times occur.
Make GPFS more resilient to sporadic errors during disk access. Upon an unexpected error during disk open, such as ERROR_INSUFFICIENT_BUFFER, GPFS now retries the call after a brief pause.
When compesating shadow file takes long time because backup partially fail, mmbackup will show progress message.
Fixed the backward compatibility error in reading data across node on different versions. This is needed if you are upgrading from 3.4.0.6 or lower version number to 3.4.0.12 or higher GPFS version.
mmbackup Device -t incremental and --rebuild is valid syntax and will work properly.
Fix the problem that deldisk returned success even though if failed.
handleRevokeM loops when a lease callback is not responded to.
This update addresses the following APARs: IV19037 IV19165 IV20350 IV20610 IV20613 IV20615 IV20618 IV20619 IV20625 IV20627 IV20630 IV20634.

Product/Component Name:	Platform:	Fix:
General Parallel File System	Linux 64-bit,x86_64 RHEL Linux 64-bit,x86_64 SLES	GPFS-3.5.0.4-x86_64-Linux

Readme and Release notes for release 3.5.0.4 General Parallel File System 3.5.0.4 GPFS-3.5.0.4-x86_64-Linux Readme